You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I learned from #26, the CCCL team plans to refactor thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey.
However, I observed the CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key as shown below.
I know the wall-clock elapsed time of thrust::reduce_by_key is longer than the cub::DeviceReduce::ReduceByKey due to additional device-to-host data transfer. My concern is if we migrate thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey, we may use a slower implementation, leading to an impact on CCCL users.
The CUDA kernel of cub::DeviceReduce::ReduceByKey should have the same performance as thrust::reduce_by_key. The CCCL team should make every effort to ensure that refactoring the Thrust APIs to use the Cub APIs does not cause any performance impact, regardless of the input size and type.
Is this a duplicate?
Type of Bug
Performance
Component
CUB
Describe the bug
As I learned from #26, the CCCL team plans to refactor thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey.
However, I observed the CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key as shown below.
I know the wall-clock elapsed time of thrust::reduce_by_key is longer than the cub::DeviceReduce::ReduceByKey due to additional device-to-host data transfer. My concern is if we migrate thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey, we may use a slower implementation, leading to an impact on CCCL users.
How to Reproduce
Expected behavior
The CUDA kernel of cub::DeviceReduce::ReduceByKey should have the same performance as thrust::reduce_by_key. The CCCL team should make every effort to ensure that refactoring the Thrust APIs to use the Cub APIs does not cause any performance impact, regardless of the input size and type.
Reproduction link
No response
Operating System
SUSE Linux Enterprise Server 15 SP5
nvidia-smi output
NVCC version
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
The text was updated successfully, but these errors were encountered: