Open
Description
For platform compatible, we didn't use device max work group size to launch kernel, and switch to query specific max work group size for kernel by SYCL API. following is our code example
auto kid = ::sycl::get_kernel_id<KernelClass>();
auto kbundle = ::sycl::get_kernel_bundle<::sycl::bundle_state::executable>(
ctx, {dev}, {kid});
::sycl::kernel k = kbundle.get_kernel(kid);
int max_work_group_size = k.get_info<::sycl::info::kernel_device_specific::work_group_size>(dev);
We found this usage takes much host overhead in application. we measured one kernel CPU performance here, each API name in table maps example code:
<style> </style>API | get_kernel_id | get_kernel_bundle | get_kernel | get_info |
---|---|---|---|---|
time (us) | 0.434 | 42.481 | 4.241 | 1.125 |
We also file internal jira to track this issue. Can you help evaluate this slow performance.