-
Notifications
You must be signed in to change notification settings - Fork 182
Open
Description
Thank you for providing this library. I have found it immensely helpful when using Vulkan compute shaders.
I am profiling my application, and I found that copying back to the CPU takes significantly more time than in CUDA. Specifically, this line cost 95% of my run time:
std::vector<std::pair<uint32_t, uint32_t>> candidates =
t_candidates->vector<std::pair<uint32_t, uint32_t>>();where t_candidates is created as
auto t_candidates = mgr->tensor(
candidates_size, 2 * sizeof(uint32_t),
kp::Memory::DataTypes::eCustom, kp::Memory::MemoryTypes::eDevice);I tried to reproduce this same effect in an as-simple-as-possible example using the Python bindings for Komput and PyCUDA. I have attached a PDF of the notebook. You can see copying 51 MiB from the GPU to CPU costs 7.3 ms ± 399 µs for CUDA but 292 ms ± 160 µs for Kompute.
Is this an inherent limitation of Vulkan, or is there a way to speed up this copy?
Here are my GPU specs:
Device: NVIDIA GeForce RTX 5080
Compute Capability: (12, 0)
Total Memory: 15817 MiB
Max threads per block: 1024
Total number of SMs: 84
Attachments:
cuda_vs_kompute.pdf
Metadata
Metadata
Assignees
Labels
No labels