Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: The CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key #3942

Open
1 task done
lilohuang opened this issue Feb 26, 2025 · 0 comments
Labels
bug Something isn't working right.

Comments

@lilohuang
Copy link

lilohuang commented Feb 26, 2025

Is this a duplicate?

Type of Bug

Performance

Component

CUB

Describe the bug

As I learned from #26, the CCCL team plans to refactor thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey.

However, I observed the CUDA kernel of cub::DeviceReduce::ReduceByKey is likely slower than thrust::reduce_by_key as shown below.

Image

I know the wall-clock elapsed time of thrust::reduce_by_key is longer than the cub::DeviceReduce::ReduceByKey due to additional device-to-host data transfer. My concern is if we migrate thrust::reduce_by_key to use cub::DeviceReduce::ReduceByKey, we may use a slower implementation, leading to an impact on CCCL users.

How to Reproduce

% nvcc -arch=sm_80 main.cu -lcudart
% nsys profile --stats=true --trace=cuda,nvtx -s cpu -b dwarf --cudabacktrace=true ./a.out
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cub/cub.cuh>
#include <nvtx3/nvToolsExt.h>

int main() {
   auto const num_elements = 250000;

   // cub::DeviceReduce::ReduceByKey
   {
      thrust::device_vector<int> data(num_elements, 1);
      thrust::device_vector<int> keys(num_elements, 1);
      thrust::device_vector<int> keys_out(num_elements);
      thrust::device_vector<int> out(num_elements);
      thrust::device_vector<int> num_runs_out(1);
      size_t temp_storage_bytes = 0;
      cub::DeviceReduce::ReduceByKey(
         nullptr, temp_storage_bytes,
         keys.begin(), keys_out.begin(),
         data.begin(), out.begin(),
         num_runs_out.begin(),
         thrust::plus<int>(), num_elements);
      thrust::device_vector<char> d_temp_storage(temp_storage_bytes);

      nvtxRangePushA("cub::DeviceReduce::ReduceByKey");
      cub::DeviceReduce::ReduceByKey(
         d_temp_storage.data().get(), temp_storage_bytes,
         keys.begin(), keys_out.begin(),
         data.begin(), out.begin(),
         num_runs_out.begin(),
         thrust::plus<int>(), num_elements);
      nvtxRangePop();
   }

   // thrust::reduce_by_key
   {
      thrust::device_vector<int> data(num_elements, 1);
      thrust::device_vector<int> keys(num_elements, 1);
      thrust::device_vector<int> keys_out(num_elements);
      thrust::device_vector<int> out(num_elements);
      nvtxRangePushA("thrust::reduce_by_key");
      thrust::reduce_by_key(keys.begin(), keys.end(), data.begin(), keys_out.begin(), out.begin());
      nvtxRangePop();
   }

   return 0;
}

Expected behavior

The CUDA kernel of cub::DeviceReduce::ReduceByKey should have the same performance as thrust::reduce_by_key. The CCCL team should make every effort to ensure that refactoring the Thrust APIs to use the Cub APIs does not cause any performance impact, regardless of the input size and type.

Reproduction link

No response

Operating System

SUSE Linux Enterprise Server 15 SP5

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   33C    P0              74W / 400W |    923MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   33C    P0              56W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   31C    P0              60W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   31C    P0              57W / 400W |     39MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

NVCC version

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

@lilohuang lilohuang added the bug Something isn't working right. label Feb 26, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

1 participant