Open
Description
Describe the bug
For the Pooling operation in SYCL, a reproducer is created to improve the performance i.e., attaining a higher memory bandwidth on Nvidia device.
To Reproduce
- The source code is obtained from setup.sh, ref_pooling.cpp and ref_pooling_kernels.hpp
- Execute the below to setup the environment.
source setup.sh
To compile,
clang++ -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda ref_pooling.cpp
- To Check the output of the reproducer, execute
./a.out
- Currently, We are observing very low bandwidth as discussed below.
Observed behavior
The minimum bandwidth observed is:
Forward Bandwidth :29.908275 gb/sec
Backward Bandwidth :17.696751 gb/sec
The maximum bandwidth observed is:
Forward Bandwidth :44.234640 gb/sec
Backward Bandwidth :25.833750 gb/sec
Expected behavior
Ideally, maximum utilization of the device is the expected and hence for the current reproducer at least 40-60% of the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) is expected
i.e., a minimum of 100 GBPS is desired.
Environment
- OS: Ubuntu 22.04.1 LTS
- Target device and vendor: Nvidia, Tesla T4
- DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
- Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5
Additional context
- To see the accurate value, run a.out multiple times.
- The inputs and algorithm names are hard coded.
- Code files are attached here
Pooling.zip