Skip to content

[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for Shuffle  #8322

Open
@ManojkumarGawali

Description

@ManojkumarGawali

Describe the bug
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
Shuffle primitive on Nvidia.
This reproducer computes the Shuffle algorithm and then the memory bandwidth is calculated.

Reproducer

For the reproducer code, refer the attachments setup.sh, shuffle.cpp and shuffle_kernels.hpp

  1. Go to the directory having the reproducer files and run the below script to setup the environment.
    source setup.sh

  2. To compile, run.
    clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda shuffle.cpp

  3. The above generates the output file. To see the output bandwidth, run
    ./a.out

Observed behaviour

For, axis = 1; axis_size = 40; ndims = 5; outer_size = 10; inner_size = 2240; group_size = 4; wg_size = 32; MB = 10;
C = 40; H = 20; W = 2; D = 56; HW = 40; SP = 2240;

Result,
Total time = 0.048315 sec
Total bandwidth = 0.148360 Gb/s

Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.

Environment

-OS: Ubuntu 22.04.1 LTS
-Target device and vendor: Nvidia, Tesla T4
-DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
-Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5

Additional context

Currently axis, axis_size, ndims, outer_size, inner_size, group_size, wg_size, MB, C, H, W, D, HW, SP are hard coded can be changed as per need.
Attached the source code of the reproducer below
Shuffle_Reproducer.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcudaCUDA back-endperformancePerformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions