Skip to content

[SYCL] Low Bandwidth Obtained for Pooling Kernels in SYCL Implementation #8321

Open
@AJAYRATNAM

Description

@AJAYRATNAM

Describe the bug 
For the Pooling operation in SYCL, a reproducer is created to improve the performance i.e., attaining a higher memory bandwidth on Nvidia device. 

To Reproduce 

  1. The source code is obtained from setup.sh, ref_pooling.cpp and ref_pooling_kernels.hpp 
  2. Execute the below to setup the environment.
        source setup.sh
    To compile,
        clang++ -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda ref_pooling.cpp
  3. To Check the output of the reproducer, execute
        ./a.out
  4. Currently, We are observing very low bandwidth as discussed below.

Observed behavior
The minimum bandwidth observed is:

Forward Bandwidth :29.908275 gb/sec
Backward Bandwidth :17.696751 gb/sec 

The maximum bandwidth observed is:

Forward Bandwidth :44.234640 gb/sec
Backward Bandwidth :25.833750 gb/sec 

Expected behavior

Ideally, maximum utilization of the device is the expected and hence for the current reproducer at least 40-60% of the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) is expected 
i.e., a minimum of 100 GBPS is desired. 

Environment 

  • OS: Ubuntu 22.04.1 LTS
  • Target device and vendor: Nvidia, Tesla T4
  • DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)                    
  • Dependencies version: Driver Version: 495.29.05    CUDA Version: 11.5 

Additional context 

  1. To see the accurate value, run a.out multiple times. 
  2. The inputs and algorithm names are hard coded. 
  3. Code files are attached here
    Pooling.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcudaCUDA back-endperformancePerformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions