Optimization: Reduce the number of pi_event-related PI calls

**Is your feature request related to a problem? Please describe**

When in-order queues are used, some events returned from `queue::submit` can be discarded. This leads to a repeated "createEvent/recordEvent/destroyEvent" around any submitted operation.

Here is an example from our application running with the CUDA backend, where we see the following sequence of operations:

- Submission of `BondedKernel`:
  - `cuEventCreate`
  - Enqueueing the kernel
  - `cuEventRecord`
  - `cuEventDestroy`, when the return `sycl::event` is discarded
- Submission of `NbnxmKernel`:
  - `cuEventCreate`
  - Enqueueing the kernel
  - `cuEventRecord`
  - `cuEventDestroy`, when the return `sycl::event` is again discarded.

This can be relevant in the cases when multiple small kernels are submitted to the device

![Screenshot_20230217_150523](https://user-images.githubusercontent.com/933873/219677744-01fe3997-d1dc-4625-80db-5991accbf58e.png)

I know of two existing mitigations:

- [`sycl_ext_oneapi_discard_queue_events`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_discard_queue_events.asciidoc): Does not allow *any* events, which is overly limiting. E.g., one might wish to submit a chain of kernels, then get an event for their completion.
- `SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS`: Level-Zero specific and requires the use of the previous one.

**Describe the solution you would like**

- [`HIPSYCL_EXT_COARSE_GRAINED_EVENTS`](https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events) seems to offer a good balance by letting the developers annotate when the recording of an event can be skipped.
  - For GROMACS, returning either an OpenSYCL-style coarse-grained event (which does a `queue::wait` when waited upon) or a DPC++-style invalid event (which throws when waited upon) is fine.
- Caching and reusing events instead of creating and destroying them. This seems to be done in #7256, but only for LevelZero and only when the whole queue is in "discard_events" mode.
  - CUDA events can be re-used just fine. I strongly suspect the same for HIP.

**Describe alternatives you have considered**

- Using `sycl_ext_oneapi_discard_queue_events` and a host-task in a separate, non-discarding queue doing `queue::wait` whenever an event is needed. It seems overly complicated and highly likely inefficient when the event is needed for synchronizing between two GPU queues.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization: Reduce the number of pi_event-related PI calls #8390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization: Reduce the number of pi_event-related PI calls #8390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions