Skip to content

Memory Copies from Device Significantly Slower than CUDA #420

@zfergus

Description

@zfergus

Thank you for providing this library. I have found it immensely helpful when using Vulkan compute shaders.

I am profiling my application, and I found that copying back to the CPU takes significantly more time than in CUDA. Specifically, this line cost 95% of my run time:

std::vector<std::pair<uint32_t, uint32_t>> candidates = 
    t_candidates->vector<std::pair<uint32_t, uint32_t>>();

where t_candidates is created as

auto t_candidates = mgr->tensor(
    candidates_size, 2 * sizeof(uint32_t),
    kp::Memory::DataTypes::eCustom, kp::Memory::MemoryTypes::eDevice);

I tried to reproduce this same effect in an as-simple-as-possible example using the Python bindings for Komput and PyCUDA. I have attached a PDF of the notebook. You can see copying 51 MiB from the GPU to CPU costs 7.3 ms ± 399 µs for CUDA but 292 ms ± 160 µs for Kompute.

Is this an inherent limitation of Vulkan, or is there a way to speed up this copy?


Here are my GPU specs:

Device: NVIDIA GeForce RTX 5080
Compute Capability: (12, 0)
Total Memory: 15817 MiB
Max threads per block: 1024
Total number of SMs: 84

Attachments:
cuda_vs_kompute.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions