[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

lilohuang · 2025-02-26T04:20:17Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

CUB

Describe the bug

As I learned from #26, the CCCL team plans to refactor thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey.

However, I observed the CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key as shown below.

I know the wall-clock elapsed time of thrust::reduce_by_key is longer than the cub::DeviceReduce::ReduceByKey due to additional device-to-host data transfer. My concern is if we migrate thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey, we may use a slower implementation, leading to an impact on CCCL users.

How to Reproduce

% nvcc -arch=sm_80 main.cu -lcudart
% nsys profile --stats=true --trace=cuda,nvtx -s cpu -b dwarf --cudabacktrace=true ./a.out

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cub/cub.cuh>
#include <nvtx3/nvToolsExt.h>

int main() {
   auto const num_elements = 250000;

   // cub::DeviceReduce::ReduceByKey
   {
      thrust::device_vector<int> data(num_elements, 1);
      thrust::device_vector<int> keys(num_elements, 1);
      thrust::device_vector<int> keys_out(num_elements);
      thrust::device_vector<int> out(num_elements);
      thrust::device_vector<int> num_runs_out(1);
      size_t temp_storage_bytes = 0;
      cub::DeviceReduce::ReduceByKey(
         nullptr, temp_storage_bytes,
         keys.begin(), keys_out.begin(),
         data.begin(), out.begin(),
         num_runs_out.begin(),
         thrust::plus<int>(), num_elements);
      thrust::device_vector<char> d_temp_storage(temp_storage_bytes);

      nvtxRangePushA("cub::DeviceReduce::ReduceByKey");
      cub::DeviceReduce::ReduceByKey(
         d_temp_storage.data().get(), temp_storage_bytes,
         keys.begin(), keys_out.begin(),
         data.begin(), out.begin(),
         num_runs_out.begin(),
         thrust::plus<int>(), num_elements);
      nvtxRangePop();
   }

   // thrust::reduce_by_key
   {
      thrust::device_vector<int> data(num_elements, 1);
      thrust::device_vector<int> keys(num_elements, 1);
      thrust::device_vector<int> keys_out(num_elements);
      thrust::device_vector<int> out(num_elements);
      nvtxRangePushA("thrust::reduce_by_key");
      thrust::reduce_by_key(keys.begin(), keys.end(), data.begin(), keys_out.begin(), out.begin());
      nvtxRangePop();
   }

   return 0;
}

Expected behavior

The CUDA kernel of cub::DeviceReduce::ReduceByKey should have the same performance as thrust::reduce_by_key. The CCCL team should make every effort to ensure that refactoring the Thrust APIs to use the Cub APIs does not cause any performance impact, regardless of the input size and type.

Reproduction link

No response

Operating System

SUSE Linux Enterprise Server 15 SP5

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   33C    P0              74W / 400W |    923MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   33C    P0              56W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   31C    P0              60W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   31C    P0              57W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

NVCC version

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

The text was updated successfully, but these errors were encountered:

lilohuang added the bug Something isn't working right. label Feb 26, 2025

github-project-automation bot added this to CCCL Feb 26, 2025

github-project-automation bot moved this to Todo in CCCL Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

lilohuang commented Feb 26, 2025 •

edited

Loading

[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

Comments

lilohuang commented Feb 26, 2025 • edited Loading

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

lilohuang commented Feb 26, 2025 •

edited

Loading