Open
Description
Is your feature request related to a problem? Please describe
When in-order queues are used, some events returned from queue::submit
can be discarded. This leads to a repeated "createEvent/recordEvent/destroyEvent" around any submitted operation.
Here is an example from our application running with the CUDA backend, where we see the following sequence of operations:
- Submission of
BondedKernel
:cuEventCreate
- Enqueueing the kernel
cuEventRecord
cuEventDestroy
, when the returnsycl::event
is discarded
- Submission of
NbnxmKernel
:cuEventCreate
- Enqueueing the kernel
cuEventRecord
cuEventDestroy
, when the returnsycl::event
is again discarded.
This can be relevant in the cases when multiple small kernels are submitted to the device
I know of two existing mitigations:
sycl_ext_oneapi_discard_queue_events
: Does not allow any events, which is overly limiting. E.g., one might wish to submit a chain of kernels, then get an event for their completion.SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS
: Level-Zero specific and requires the use of the previous one.
Describe the solution you would like
HIPSYCL_EXT_COARSE_GRAINED_EVENTS
seems to offer a good balance by letting the developers annotate when the recording of an event can be skipped.- For GROMACS, returning either an OpenSYCL-style coarse-grained event (which does a
queue::wait
when waited upon) or a DPC++-style invalid event (which throws when waited upon) is fine.
- For GROMACS, returning either an OpenSYCL-style coarse-grained event (which does a
- Caching and reusing events instead of creating and destroying them. This seems to be done in [SYCL] Reuse discarded L0 events in scope of command list #7256, but only for LevelZero and only when the whole queue is in "discard_events" mode.
- CUDA events can be re-used just fine. I strongly suspect the same for HIP.
Describe alternatives you have considered
- Using
sycl_ext_oneapi_discard_queue_events
and a host-task in a separate, non-discarding queue doingqueue::wait
whenever an event is needed. It seems overly complicated and highly likely inefficient when the event is needed for synchronizing between two GPU queues.