Skip to content

Optimization: Reduce the number of pi_event-related PI calls #8390

Open
@al42and

Description

@al42and

Is your feature request related to a problem? Please describe

When in-order queues are used, some events returned from queue::submit can be discarded. This leads to a repeated "createEvent/recordEvent/destroyEvent" around any submitted operation.

Here is an example from our application running with the CUDA backend, where we see the following sequence of operations:

  • Submission of BondedKernel:
    • cuEventCreate
    • Enqueueing the kernel
    • cuEventRecord
    • cuEventDestroy, when the return sycl::event is discarded
  • Submission of NbnxmKernel:
    • cuEventCreate
    • Enqueueing the kernel
    • cuEventRecord
    • cuEventDestroy, when the return sycl::event is again discarded.

This can be relevant in the cases when multiple small kernels are submitted to the device

Screenshot_20230217_150523

I know of two existing mitigations:

  • sycl_ext_oneapi_discard_queue_events: Does not allow any events, which is overly limiting. E.g., one might wish to submit a chain of kernels, then get an event for their completion.
  • SYCL_PI_LEVEL_ZERO_REUSE_DISCARDED_EVENTS: Level-Zero specific and requires the use of the previous one.

Describe the solution you would like

  • HIPSYCL_EXT_COARSE_GRAINED_EVENTS seems to offer a good balance by letting the developers annotate when the recording of an event can be skipped.
    • For GROMACS, returning either an OpenSYCL-style coarse-grained event (which does a queue::wait when waited upon) or a DPC++-style invalid event (which throws when waited upon) is fine.
  • Caching and reusing events instead of creating and destroying them. This seems to be done in [SYCL] Reuse discarded L0 events in scope of command list #7256, but only for LevelZero and only when the whole queue is in "discard_events" mode.
    • CUDA events can be re-used just fine. I strongly suspect the same for HIP.

Describe alternatives you have considered

  • Using sycl_ext_oneapi_discard_queue_events and a host-task in a separate, non-discarding queue doing queue::wait whenever an event is needed. It seems overly complicated and highly likely inefficient when the event is needed for synchronizing between two GPU queues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cudaCUDA back-endenhancementNew feature or requesthipIssues related to execution on HIP backend.performancePerformance related issuesruntimeRuntime library related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions