[SYCL] Low Bandwidth Obtained for Pooling Kernels in SYCL Implementation

**Describe the bug** 
For the Pooling operation in SYCL, a reproducer is created to improve the performance i.e., attaining a higher memory bandwidth on Nvidia device. 

**To Reproduce** 
1. The source code is obtained from _setup.sh_, _ref_pooling.cpp_ and _ref_pooling_kernels.hpp_ 
2. Execute the below to setup the environment.
    `source setup.sh `
   To compile,
    ` clang++ -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda ref_pooling.cpp`
3. To Check the output of the reproducer, execute
   ` ./a.out`
5. Currently, We are observing very low bandwidth as discussed below.


**Observed behavior**
    The minimum bandwidth observed is:
```
Forward Bandwidth :29.908275 gb/sec
Backward Bandwidth :17.696751 gb/sec 
```
The maximum bandwidth observed is:
```
Forward Bandwidth :44.234640 gb/sec
Backward Bandwidth :25.833750 gb/sec 
```

#### Expected behavior
Ideally, maximum utilization of the device is the expected and hence for the current reproducer at least 40-60% of the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) is expected 
i.e., a minimum of 100 GBPS is desired. 

**Environment** 
- OS: Ubuntu 22.04.1 LTS
- Target device and vendor: Nvidia, Tesla T4
- DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e18978754451f5c2c95129721297e2c2411)                    
- Dependencies version: Driver Version: 495.29.05    CUDA Version: 11.5 

**Additional context** 
1. To see the accurate value, run _a.out_ multiple times. 
2. The inputs and algorithm names are hard coded. 
3. Code files are attached here
[Pooling.zip](https://github.com/intel/llvm/files/10719852/Pooling.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Low Bandwidth Obtained for Pooling Kernels in SYCL Implementation #8321

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL] Low Bandwidth Obtained for Pooling Kernels in SYCL Implementation #8321

Description

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions