From 94f409f38b7caebfe2c2d110f935ee9acd4ac4bf Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Fri, 21 Oct 2022 13:31:01 +0100 Subject: [PATCH 01/11] Unify with Codeplay Graph extension. Merge [SYCL_EXT_CODEPLAY_GRAPHS](https://github.com/codeplaysoftware/standards-proposals/pull/135) into SYCL_EXT_ONEAPI_GRAPH. This is a first cut at merging and follow-up changes to reconcile some differences will likely need to be made, either as commits to this branch before merging, or as subsequent PRs. --- .../command_graph-state.svg | 4 + .../sycl_ext_oneapi_graph/queue-state.svg | 4 + .../proposed/sycl_ext_oneapi_graph.asciidoc | 792 ++++++++++++++++-- 3 files changed, 730 insertions(+), 70 deletions(-) create mode 100644 sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/command_graph-state.svg create mode 100644 sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/queue-state.svg diff --git a/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/command_graph-state.svg b/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/command_graph-state.svg new file mode 100644 index 0000000000000..f3ed6a15a1f7d --- /dev/null +++ b/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/command_graph-state.svg @@ -0,0 +1,4 @@ + + + +
Finalize
Finalize
Modifiable
Modifiable
Executable
Executable
Text is not SVG - cannot display
\ No newline at end of file diff --git a/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/queue-state.svg b/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/queue-state.svg new file mode 100644 index 0000000000000..d51956d613098 --- /dev/null +++ b/sycl/doc/extensions/proposed/images/sycl_ext_oneapi_graph/queue-state.svg @@ -0,0 +1,4 @@ + + + +

Begin Recording

Begin Recording
Executing
Executing
End Recording
End Recording
Recording
Recording
Text is not SVG - cannot display
\ No newline at end of file diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 34355dde2caca..b1d3c70f47e6a 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -8,6 +8,7 @@ :toc: left :encoding: utf-8 :lang: en +:sectnums: :blank: pass:[ +] @@ -31,6 +32,23 @@ This extension is written against the SYCL 2020 revision 5 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision. +== Contributors + +Pablo Reble, Intel + +Julian Miller, Intel + +John Pennycook, Intel + +Guo Yejun, Intel + +Ewan Crawford, Codeplay + +Ben Tracy, Codeplay + +Duncan McBain, Codeplay + +Peter Žužek, Codeplay + +Ruyman Reyes, Codeplay + +Gordon Brown, Codeplay + +Erik Tomusk, Codeplay + +Bjoern Knafla, Codeplay + +Lukas Sommer, Codeplay + +Ronan Keryell, AMD + + == Status This is a proposed extension specification, intended to gather community @@ -41,10 +59,104 @@ not rely on APIs defined in this specification.* == Introduction -This extension introduces an interface that enables a lazy execution and easy replay of a kernel graph by separating -its definition and execution. - -== Feature test macro +Through the use of command groups SYCL is already able to create a DAG of kernel +execution at runtime, as a command group object defines a set of requisites +(edges) which must be satisfied for kernels (nodes) to be executed. However, +because command-group submission is tied to execution on the queue, without +having a prior construction step before starting execution, optimization +opportunities are missed from the runtime not knowing the complete dependency +graph ahead of execution. + +The following benefits would become possible if the user could define a +dependency graph to the SYCL runtime prior to execution: + +* Reduction in runtime overhead by only submitting a single graph object, rather + than many individual commands. + +* Enable more work to be done offline, in particular producing a graph ahead of + time allows for improved performance at runtime from reduced overhead. + +* Unlock DMA hardware features through graph analysis by the runtime. + +* Whole graph optimizations become available, including but not limited to: +** Kernel fusion/fission. +** Inter-node memory reuse from data staying resident on device. +** Identification of the peak intermediate output memory requirement, used for + more optimal memory allocation. + +As well as benefits to the SYCL runtime, there are also advantages to the user +developing SYCL applications, as repetitive workloads no longer have to +redundantly issue the same sequence of commands. Instead, a graph is only +constructed once and submitted for execution as many times as is necessary, only +changing the data in input buffers or USM allocations. For machine learning +applications where the same command group pattern is run repeatedly for +different inputs, this is particularly useful. + +=== Requirements + +In order to achieve the goals described in previous sections, the following +requirements were considered: + +1. Ability to update inputs/outputs of the graph between submissions, without + changing the overall graph structure. +2. Enable low effort porting of existing applications to use the extension. +3. Profiling, debugging, and tracing functionality at the granularity of graph + nodes. +4. Integrate sub-graphs (previously constructed graphs) when constructing a new + graph. +5. Support the USM model of memory as well as buffer model. +6. Compatible with other SYCL extensions and features, e.g kernel fusion & + built-in kernels. +7. Ability to record a graph with commands submitted to different devices in the + same context. +8. A graph constructed using a device queue may be executed on another compatible + queue. +9. Capability to serialize graphs to a binary format which can then be + de-serialized and executed. This is helpful for offline cases where a graph + can be created by an offline tool to be loaded and run without the end-user + incurring the overheads of graph creation. +10. Backend interoperability, the ability to retrieve a native graph object from + the graph and use that in a native backend API. + +To allow for prototype implementations of this extension to be developed +quickly for evaluation the scope of this proposal was limited to a subset +of these requirements. In particular, the serialization functionality (9), +backend interoperability (10), and a profiling/debugging interface (3) were +omitted. As these are not easy to abstract over a number of backends without +significant investigation. It is also hoped these features can be exposed as +additive changes to the API, and so in introduced in future versions of the +extension. + +Another reason for deferring a serialize/deserialize API (9) is that its scope +could extend from emitting the graph in a binary format, to emitting a +standardized IR format that enables further device specific graph optimizations. + +Multi-device support (7) is something we are looking into introducing into +the extension, which may result in API changes. + +=== Graph Building Mechanisms + +This extension contains two different API mechanisms for constructing a graph +of commands: + +1. **Explicit graph building API** - Allows users to specify the exact nodes +and edges they want to add to the graph. + +2. **Queue recording API (aka "Record & Replay")** - Introduces state to a +`sycl::queue` such that rather than scheduling commands immediately for +execution, they are added to the graph object instead, with edges based on the +data dependencies of the command group. + +Each of these mechanisms for constructing a graph have their own advantages, so +having both APIs available allows the user to pick the one which is most +suitable for them. The queue recording API allows quicker porting of existing +applications, and can capture work done by a library in the graph. While the +explicit API can better express what data is internal to the graph for +optimization, and dependencies don't need to be inferred. + +== Specification + +=== Feature test macro This extension provides a feature-test macro as described in the core SYCL specification section 6.3.3 "Feature test macros". Therefore, an @@ -61,95 +173,203 @@ Table 1. Values of the `SYCL_EXT_ONEAPI_GRAPH` macro. |1 |Initial extension version. Base features are supported. |=== -== SYCL Graph Terminology +=== SYCL Graph Terminology + +:explicit-memory-ops: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#subsec:explicitmemory Table 2. Terminology. [%header,cols="1,3"] |=== -|Concept|Description -|graph| Class that stores structured work units and their dependencies. -|node| The unit of work. Can have different attributes. -|edge| Dependency between work units. Happens-before relation. +| Concept | Description + +| Graph +| `command_graph` class that stores structured commands and their dependencies. + +A SYCL graph is a collection of commands (nodes) and their dependencies (edges). +From the SYCL perspective, this graph will be acyclic and directed (DAG) as +users cannot express a cycle in the core SYCL API. + +| Node +| A command, which can have different attributes. + +When recording a queue to construct a graph, nodes in a SYCL graph represent +each of the command group submissions of the program. Each submission +encompasses either one or both of a.) some data movement, b.) a single +asynchronous kernel launch. Nodes cannot define forward edges, only backwards +(i.e. kernels can only create dependencies on things that have already +happened). This means that transparently a node can depend on a previously +recorded graph (sub-graph), which works by creating edges to the individual nodes +in the old graph. Explicit memory operations without kernels, such as a memory +copy, are still classed as nodes under this definition, as the +{explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as +specialized kernels executing on the device. + +In the explicit graph building API, nodes can also represent a memory allocation/free +operation on the device. + +| Edge +| Dependency between commands as a happens-before relationship. + +When recording a queue to construct a graph, an edge in the SYCL graph represents +a data dependency between two nodes. These dependencies are expressed by the user +code through buffer accessors. There is also the partial ability to track USM +data dependencies provided the pointers used in the graph nodes are the same. +With the limitation that a node taking an offset USM pointer input will not be +identified as having an edge to another node taking a pointer input to the base +address of the same USM allocation. + +In the explicit graph building API, `make_edge()` is used to define the dependency +rather than inferring them from data dependencies. |=== -== Node +=== API Modifications -Node is a class that encapsulates tasks like SYCL kernel functions or host tasks for deferred execution. -A graph has to be created first, the structure of a graph is defined second by adding nodes and edges. +[source, c++] +---- +namespace sycl { +namespace ext::oneapi::experimental { + +// State of a queue, returned by info::queue::state +enum class queue_state { + executing, + recording +}; + +class node {}; + +void make_edge(node sender, node receiver); + +// State of a graph +enum class graph_state { + modifiable, + executable +}; + +// New object representing graph +template +class command_graph {}; + +template<> +class command_graph { +public: + command_graph(const property_list &propList = {}); + command_graph finalize(context &syclContext) const; + + node add(const std::vector& dep = {}); + + template + node add(T cgf, const std::vector& dep = {}); + + node add_malloc_device(void *&data, size_t numBytes, const std::vector& dep = {}); + node add_free(void *data, const std::vector& dep = {}); +}; + +template<> +class command_graph { +public: + command_graph() = delete; + void update(const command_graph &graph); +}; +} // namespace ext::oneapi::experimental + +// New methods added to the sycl::queue class +using namespace ext::oneapi::experimental; +class queue { +public: + bool begin_recording(command_graph &graph); + bool end_recording(); + event submit(command_graph graph); +}; +} // namespace sycl +---- + +=== Node + +Node is a class that encapsulates tasks like SYCL kernel functions or host tasks +for deferred execution. A graph has to be created first, the structure of a +graph is defined second by adding nodes and edges. [source,c++] ---- namespace sycl::ext::oneapi::experimental { - - class node{ - }; + class node {}; } ---- -== Edge +=== Edge -A dependency between two nodes representing a happens-before relationship. `sender` and `receiver` may be associated to different graphs. +A dependency between two nodes representing a happens-before relationship. +`sender` and `receiver` may be associated to different graphs. [source,c++] ---- namespace sycl::ext::oneapi::experimental { - - // Adding dependency between two nodes. - void make_edge(node sender, node receiver); + void make_edge(node sender, node receiver); // Adds a dependency between two nodes } ---- -== Graph +Parameters: -Graph is a class that represents a directed acyclic graph of nodes. -A graph can have different states, can be nested, can have multiple root nodes that are scheduled for execution first and multiple leaf nodes that are scheduled for execution last. The execution of a graph has been completed when all leaf node tasks have been completed. -Member functions as listed in Table 3 to 6 can be used to add nodes to a graph. +* `sender` - Node which will be a dependency of `receiver`. -[source,c++] ----- -namespace sycl::ext::oneapi::experimental { +* `receiver` - Node which will be dependent on `sender`. - enum class graph_state{ - modifiable, - executable - }; - - template - class command_graph { - public: - operator command_graph(); - }; - - template<> - class command_graph{ - public: - command_graph() = delete; - }; - -} +Exceptions: ----- +* TODO - Throw if this introduces a cycle? -The following member functions are added to the queue class. +=== Graph -[source,c++] ----- +:crs: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics -namespace sycl { +This extension adds a new `command_graph` object which follows the +{crs}[common reference semantics] of other SYCL runtime objects. -event queue::submit(const ext::oneapi::experimental::command_graph& my_graph); +`command_graph` is a class that represents a directed acyclic graph of nodes. A +graph can have different states, can be nested, can have multiple root nodes +that are scheduled for execution first and multiple leaf nodes that are +scheduled for execution last. The execution of a graph has been completed when +all leaf node tasks have been completed. -} // namespace sycl +A `command_graph` is built up by either recording queue submissions or +explicitly adding nodes, then once the user is happy that the graph is complete, +the graph instance is finalized into an executable variant which can have no +more nodes added to it. Finalization may be a computationally expensive +operation as the runtime is able to perform optimizations based on the graph +structure. After finalization the graph can be submitted for execution on a +queue one or more times with reduced overhead. ----- +==== Graph State + +An instance of a `command_graph` object can be in one of two states: + +* **Modifiable** - Graph is under construction and new nodes may be added to it. +* **Executable** - Graph topology is fixed after finalization and graph is ready to + be submitted for execution. -=== Executable Graph +A `command_graph` object is constructed in the _recording_ state and is made +_executable_ by the user invoking `command_graph::finalize()` to create a +new executable instance of the graph. An executable graph cannot be converted +to a modifiable graph. After finalizing a graph in the modifiable state it is +valid for a user to add additional nodes and finalize again to create subsequent +executable graphs. The state of a `command_graph` object is made explicit by +templating on state to make the class strongly typed, with the default template +argument being `graph_state::modifiable` to reduce code verbosity on +construction. -A `command_graph` object in `graph_state::executable` represents a user generated device and context specific execution object that is submitted to a queue for execution. -The structure of such a `command_graph` object in this state is immutable and cannot be changed, so are the tasks assigned with each node. -Support of submitting a graph for execution, before a previous execution has been completed is backend specific. The runtime may throw an error. +.Graph State Diagram +image::images/sycl_ext_oneapi_graph/command_graph-state.svg[] -=== Graph member and helper functions +==== Executable Graph Update + +A graph in the executable state can have each nodes inputs & outputs updated +using the `command_graph::update()` method. This takes a graph in the +modifiable state and updates the executable graph to use the node input & +outputs of the modifiable graph, a technique called _Whole Graph Update_. The +modifiable graph must have the same topology as the graph originally used to +create the executable graphs, with the nodes added in the same order. + +==== Graph Member Functions Table 3. Constructor of the `command_graph` class. [cols="2a,a"] @@ -159,10 +379,21 @@ Table 3. Constructor of the `command_graph` class. | [source,c++] ---- -/* available only when graph_state == modifiable */` -command_graph(); +command_graph(const property_list &propList = {}); ---- -|Creates a `command_graph` object. +|Creates a SYCL `command_graph` object in the modifiable state. +Zero or more properties can be provided to the constructed SYCL `command_graph` +via an instance of `property_list`. + +Preconditions: + +* This constructor is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `propList` - Optional parameter for passing properties. No new properties are + defined by this extension. |=== @@ -176,26 +407,65 @@ Table 4. Member functions of the `command_graph` class. ---- node add(const std::vector& dep = {}); ---- -|This creates an empty node which is associated to no task. Its intended use is either a connection point inside a graph between groups of nodes, and can significantly reduce the number of edges ( O(n) vs. O(n^2) ). Another use-case is building the structure of a graph first and adding tasks later. +|This creates an empty node which is associated to no task. Its intended use is +either a connection point inside a graph between groups of nodes, and can +significantly reduce the number of edges ( O(n) vs. O(n^2) ). Another use-case +is building the structure of a graph first and adding tasks later. + +Parameters: + +* `dep` - Nodes the created node will be dependent on. + +Returns: The empty node which has been added to the graph. | [source,c++] ---- template - node add(T cgf, const std::vector& dep = {}); +node add(T cgf, const std::vector& dep = {}); ---- -|This function adds a command group function object to a graph. The function object can contain single or multiple commands such as a host task which is scheduled by the SYCL runtime or a SYCL function for invoking kernels with all restrictions that apply as described in the spec. +|This function adds a command group function object to a graph. The function +object can contain single or multiple commands such as a host task which is +scheduled by the SYCL runtime or a SYCL function for invoking kernels with all +restrictions that apply as described in the core specification. + +Parameters: + +* `cgf` - Command group function object to be added as a node + +* `dep` - Nodes the created node will be dependent on. + +Returns: The command-group function object node which has been added to the graph. | [source,c++] ---- command_graph finalize(context &syclContext) const; ---- -| This function creates an executable graph object with an immutable topology that can be executed on a queue that matches the given context. +|Synchronous operation that creates a graph in the executable state with a +fixed topology that can be submitted for execution on any queue sharing the +supplied context. It is valid to call this method multiple times to create +subsequent executable graphs. It is also valid to continue to add new nodes to +the modifiable graph instance after calling this function. It is valid to +finalize an empty graph instance with no recorded commands. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `syclContext` - The context asscociated with the queues to which the + executable graph will be able to be submitted. + +Returns: An executable graph object which can be submitted to a queue. |=== -Memory that is allocated by the following functions is owned by the specific graph. When freed inside the graph, the memory is only accessible before the `free` node is executed and after the `malloc` node is executed. +Memory that is allocated by the following functions is owned by the specific +graph. When freed inside the graph, the memory is only accessible before the +`free` node is executed and after the `malloc` node is executed. Table 5. Member functions of the `command_graph` class (memory operations). [cols="2a,a"] @@ -209,6 +479,16 @@ node add_malloc_device(void *&data, size_t numBytes, const std::vector& de ---- |Adding a node that encapsulates a `malloc` operation. +Parameters: + +* `data` - Return parameter set to the address of memory allocated. + +* `numBytes` - Size in bytes to allocate. + +* `dep` - Nodes the created node will be dependent on. + +Returns: The memory allocation node which has been added to the graph + | [source,c++] ---- @@ -216,13 +496,274 @@ node add_free(void *data, const std::vector& dep = {}); ---- |Adding a node that encapsulates a `free` operation. +Parameters: + +* `data` - Address of memory to free. + +* `dep` - Nodes the created node will be dependent on. + +Returns: The memory freeing node which has been added to the graph. + +Exceptions: + +* TODO - Throw if not allocated by `add_malloc_device`? +* TODO - Throw if already freed? +* TODO - Throw if not valid address? + +|=== + +Table 6. Member functions of the `command_graph` class (executable graph update). +[cols="2a,a"] |=== +|Member function|Description + +| +[source, c++] +---- +void command_graph update(const command_graph &graph); +---- + +|Updates the executable graph node inputs & outputs from a topologically +identical modifiable graph. The effects of the update will be visible +on the next submission of the executable graph without the need for additional +user synchronization. + +Parameters: + +* `graph` - Modifiable graph object to update graph node inputs & outputs with. + This graph must have the same topology as the original graph used on + executable graph creation. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::executable`. + +Exceptions: + +* Throws synchronously with error code `invalid` if the topology of `graph` is + not the same as the existing graph topology, or if the nodes were not added in + the same order. +|=== + +=== Queue Class Modifications + +:queue-class: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.queue.class + +This extension modifies the {queue-class}[SYCL queue class] such that +<> is introduced to queue objects, allowing an instance to be +put into a mode where command-groups are recorded to a graph rather than +submitted immediately for execution. + +<> are also added to the +`sycl::queue` class with this extension. Two functions for selecting the state +of the queue, and another function for submitting a graph to the queue. + +==== Queue State + +:queue-info-table: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.queue.info + +The `sycl::queue` object can be in either of two states. The default +`queue_state::executing` state is where the queue has its normal semantics of +submitted command-groups being immediately scheduled for asynchronous execution. + +The alternative `queue_state::recording` state is used for graph construction. +Instead of being scheduled for execution, command-groups submitted to the queue +are recorded to a graph object as new nodes for each submission. After recording +has finished and the queue returns to the executing state, the recorded commands are +not then executed, they are transparent to any following queue operations. + +.Queue State +image::images/sycl_ext_oneapi_graph/queue-state.svg[] + +The state of a queue can be queried with `queue::get_info` using template +parameter `info::queue::state`. The following entry is added to the +{queue-info-table}[queue info table] to define this query: + +Table 7. Queue info query +[cols="2a,a,a"] +|=== +| Queue Descriptors | Return Type | Description + +| `info::queue::state` +| `ext::oneapi::experimental::queue_state` +| Returns the state of the queue + +|=== + +A default constructed event is returned when a user submits a command-group to +a queue in the recording state. These events have status +`info::event_command_status::complete` and a user waiting on them will return +immediately. + +==== Queue Properties + +:queue-properties: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:queue-properties + +There are {queue-properties}[two properties] defined by the core SYCL +specification that can be passed to a `sycl::queue` on construction via the +property list parameter. They interact with this extension in the following +ways: + +1. `property::queue::in_order` - When a queue is created with the in-order + property, recording its operations results in a straight-line graph, as each + operation has an implicit dependency on the previous operation. However, + a graph submitted to an in-order queue will keep its existing structure such + that the complete graph executes in-order with respect to the other + command-groups submitted to the queue. + +2. `property::queue::enable_profiling` - This property has no effect on graph + recording. When set on the queue a graph is submitted to however, it allows + profiling information to be obtained from the event returned by a graph + submission. + +For any other queue property that is defined by an extension, it is the +responsibility of the extension to define the relationship between that queue +property and this graph extension. + +==== New Queue Member Functions + +Table 8. Additional member functions of the `sycl::queue` class. +[cols="2a,a"] +|=== +|Member function|Description + +| +[source, c++] +---- +using namespace ext::oneapi::experimental; +bool queue::begin_recording(command_graph &graph) +---- + +|Synchronously changes the state of the queue to the `queue_state::recording` +state. + +Parameters: + +* `graph` - Graph object to start recording commands to. + +Returns: `true` if the queue was previously in the `queue_state::executing` +state, `false` otherwise. + +Exceptions: + +* Throws synchronously with error code `invalid` if the queue is already + recording to a different graph. + +| +[source, c++] +---- +bool queue::end_recording() +---- + +|Synchronously changes the state of the queue to the `queue_state::executing` +state. + +Returns: `true` if the queue was previously in the `queue_state::recording` +state, `false` otherwise. + +| +[source,c++] +---- +using namespace ext::oneapi::experimental; +event queue::submit(command_graph graph) +---- + +|When invoked with the queue in the `queue_state::recording` state, a graph is +added as a subgraph node. When invoked with the queue in the default +`queue_state::executing` state, the graph is submitted for execution. Support +for submitting a graph for execution, before a previous execution has been +completed is backend specific. The runtime may throw an error. + +Parameters: + +* `graph` - Graph object to start recording commands to. + +When the queue is in the execution state, an `event` object used to synchronize +graph submission with other command-groups is returned. Otherwise the queue is +in the recording state, and a default event is returned. +|=== + +=== Thread Safety + +The new functions in this extension are thread-safe, the same as member +functions of classes in the base SYCL specification. If user code does +not perform synchronisation between two threads accessing the same queue, +there is no strong ordering between events on that queue, and the kernel +submissions, recording and finalization will happen in an undefined order. + +In particular, when one thread ends recording on a queue while another +thread is submitting work, which kernels will be part of the subsequent +graph is undefined. If user code enforces a total order on the queue +events, then the behaviour is well-defined, and will match the observable +total order. + +The returned value from the `info::queue::state` should be considered +immediately stale in multi-threaded usage, as another thread could have +preemptively changed the state of the queue. + +=== Error Handling + +Errors are reported through exceptions, as usual in the SYCL API. For new APIs, +submitting a graph for execution can generate unspecified asynchronous errors, +while `command_graph::finalize()` may throw unspecified synchronous exceptions. +Synchronous exception errors codes are defined for both +`queue::begin_recording()` and `command_graph::update()`. + +When a queue is in recording mode asynchronous exceptions will not be +generated, as no device execution is occuring. Synchronous errors specified as +being thrown in the default queue executing state, will still be thrown when a +queue is in the recording state. + +The `queue::begin_recording` and `queue::end_recording` entry-points return a +`bool` value informing the user whether a state change occurred. False is +returned rather than throwing an exception when state isn't changed. This design +is because the queue is already in the state the user desires, so if the +function threw an exception in this case, the application would likely swallow +it and then proceed. + +While a queue is in the recording state, methods performed on that queue which +are not command submissions behave as normal. This includes waits, throws, and +queries on the queue. These are all ignored by the graph system, as opposed to +throwing an exception when in queue recording mode. This is because otherwise +there would be no thread safe way for a user to check they could call these +functions without throwing, as a query about the state of the queue may be +immediately stale. + +* TODO - error on add_node while being recorded to a queue? or queue recording a + graph with explicitly build nodes? + +=== Storage Lifetimes + +The lifetime of any buffer recorded as part of a submission +to a command graph will be extended in keeping with the common reference +semantics and buffer synchronization rules in the SYCL specification. It will be +extended either for the lifetime of the graph (including both modifiable graphs +and the executable graphs created from them) or until the buffer is no longer +required by the graph (such as after being replaced through executable graph update). + +=== Host Tasks + +:host-task: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#subsec:interfaces.hosttasks + +A {host-task}[host task] is a native C++ callable, scheduled according to SYCL +dependency rules. It is valid to record a host task as part of graph, though it +may lead to sub-optimal graph performance because a host task node may prevent +the SYCL runtime from submitting the whole graph to the device at once. + +Host tasks can be updated as part of <> +by replacing the whole node with the new callable. == Examples -// NOTE: The examples below demonstrate intended usage of the extension, but are not compatible with the proof-of-concept implementation. The proof-of-concept implementation currently requires different syntax, as described in the "Non-implemented features" section at the end of this document. +[NOTE] +==== +The examples below demonstrate intended usage of the extension, but may not be +compatible with the proof-of-concept implementation, as the proof-of-concept +implementation is currently under development. +==== -1. Dot product +=== Dot Product [source,c++] ---- @@ -241,7 +782,7 @@ int main() { sycl::ext::oneapi::experimental::command_graph g; float *x , *y, *z; - + float *dotp = sycl::malloc_shared(1, q); auto n_x = g.add_malloc_device(x, n); @@ -282,7 +823,7 @@ int main() { }); }, {node_a, node_b}); - + auto node_f1 = g.add_free(x, {node_c}); auto node_f2 = g.add_free(y, {node_b}); @@ -301,6 +842,85 @@ int main() { ... ---- +=== Diamond Dependency + +The following snippet of code shows how a SYCL `queue` can be put into a +recording state, which allows a `command_graph` object to be populated by the +command-groups submitted to the queue. Once the graph is complete, recording +finishes on the queue to put it back into the default executing state. The +graph is then finalized so that no more nodes can be added. Lastly, the graph is +submitted as a whole for execution via +`queue::submit(command_graph)`. + +[source, c++] +---- + queue q{default_selector{}}; + + // New object representing graph of command-groups + ext::oneapi::experimental::command_graph graph; + { + buffer bufferA{dataA.data(), range<1>{elements}}; + buffer bufferB{dataB.data(), range<1>{elements}}; + buffer bufferC{dataC.data(), range<1>{elements}}; + + // `q` will be put in the recording state where commands are recorded to + // `graph` rather than submitted for execution immediately. + q.begin_recording(graph); + + // Record commands to `graph` with the following topology. + // + // increment_kernel + // / \ + // A->/ A->\ + // / \ + // add_kernel subtract_kernel + // \ / + // B->\ C->/ + // \ / + // decrement_kernel + + q.submit([&](handler &cgh) { + auto pData = bufferA.get_access(cgh); + cgh.parallel_for(range<1>(elements), + [=](item<1> id) { pData[id]++; }); + }); + + q.submit([&](handler &cgh) { + auto pData1 = bufferA.get_access(cgh); + auto pData2 = bufferB.get_access(cgh); + cgh.parallel_for(range<1>(elements), + [=](item<1> id) { pData2[id] += pData1[id]; }); + }); + + q.submit([&](handler &cgh) { + auto pData1 = bufferA.get_access(cgh); + auto pData2 = bufferC.get_access(cgh); + cgh.parallel_for( + range<1>(elements), [=](item<1> id) { pData2[id] -= pData1[id]; }); + }); + + q.submit([&](handler &cgh) { + auto pData1 = bufferB.get_access(cgh); + auto pData2 = bufferC.get_access(cgh); + cgh.parallel_for(range<1>(elements), [=](item<1> id) { + pData1[id]--; + pData2[id]--; + }); + }); + + // queue will be returned to the executing state where commands are + // submitted immediately for extension. + q.end_recording(); + } + + // Finalize the modifiable graph to create an executable graph that can be + // submitted for execution. + auto exec_graph = graph.finalize(q.get_context()); + + // Execute graph + q.submit(exec_graph); +---- + // == Issues for later investigations // // . Explicit memory movement can cause POC to stall. @@ -314,6 +934,37 @@ int main() { // . Submit overload of a queue. `submit(graph)` Use a combination of `executable_graph::exec_and_wait()` and queue property `sycl::ext::oneapi::property::queue::lazy_execution{}` instead. // . `class graph` Use dedicated `class graph` (equivalent to `graph_state == modifiable`) and `class executable_graph` (equivalent to `graph_state == executable`) instead. +== Issues + +=== Multi Device Graph + +Allow an executable graph to contain nodes targeting different devices. + +**Outcome:** Under consideration + +=== Record & Replay: Mark Internal Memory + +When a graph is created by recording a queue there is no way to tag memory +objects internal to the graph, which would enable optimizations on the internal +memory. Do we need an interface record & replay can use to identify buffers and +USM allocations not used outside of the graph? + +**Outcome:** Unresolved + +=== Executable Graph Update + +Is there a ML usecase (e.g pytorch workload) which justifies the inclusion of +this feature in the extension. + +**Outcome:** Unresolved + +=== Graph Submimssion Synchronization + +Should we provide a mechanism for a graph submission to depend on other graph +submission events or any arbitrary sycl event? + +**Outcome:** Unresolved + == Revision History [cols="5,15,15,70"] @@ -325,4 +976,5 @@ int main() { |2|2022-03-11|Pablo Reble|Incorporate feedback from PR |3|2022-05-25|Pablo Reble|Extend API and Example |4|2022-08-10|Pablo Reble|Adding USM shortcuts +|5|2022-10-21|Ewan Crawford|Merge in Codeplay vendor extension |======================================== From 30858e9f606935896c07dd5bae184e3df8fda3d7 Mon Sep 17 00:00:00 2001 From: Ben Tracy Date: Tue, 25 Oct 2022 14:43:52 +0100 Subject: [PATCH 02/11] Update finalize() to use const ref context Makes the context accepted by the command_graph::finalize() method const, consistent with most other usages in the SYCL spec. --- sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index b1d3c70f47e6a..56d3fa3e21b2c 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -253,7 +253,7 @@ template<> class command_graph { public: command_graph(const property_list &propList = {}); - command_graph finalize(context &syclContext) const; + command_graph finalize(const context &syclContext) const; node add(const std::vector& dep = {}); @@ -440,7 +440,7 @@ Returns: The command-group function object node which has been added to the grap | [source,c++] ---- -command_graph finalize(context &syclContext) const; +command_graph finalize(const context &syclContext) const; ---- |Synchronous operation that creates a graph in the executable state with a From 145d3dbd92fc4246d5c181d1fcb538d05642f41c Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Tue, 1 Nov 2022 10:53:32 +0000 Subject: [PATCH 03/11] Remove TODOs These are now tracked in GitHub Issues: * [Error on cycle](https://github.com/reble/llvm/issues/12) * [Mixing graph building mechanisms](https://github.com/reble/llvm/issues/11) * [add_free errors](https://github.com/reble/llvm/issues/10) --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 56d3fa3e21b2c..462e7c62b8b74 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -314,10 +314,6 @@ Parameters: * `receiver` - Node which will be dependent on `sender`. -Exceptions: - -* TODO - Throw if this introduces a cycle? - === Graph :crs: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics @@ -504,12 +500,6 @@ Parameters: Returns: The memory freeing node which has been added to the graph. -Exceptions: - -* TODO - Throw if not allocated by `add_malloc_device`? -* TODO - Throw if already freed? -* TODO - Throw if not valid address? - |=== Table 6. Member functions of the `command_graph` class (executable graph update). @@ -730,9 +720,6 @@ there would be no thread safe way for a user to check they could call these functions without throwing, as a query about the state of the queue may be immediately stale. -* TODO - error on add_node while being recorded to a queue? or queue recording a - graph with explicitly build nodes? - === Storage Lifetimes The lifetime of any buffer recorded as part of a submission From f90877e28a1a559eead3be75bdfb549b9ebec0ad Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Wed, 2 Nov 2022 09:56:50 +0000 Subject: [PATCH 04/11] Remove compatible queue requirement Remove the requirement for the graphs extension to support a graph constructed using a device queue being executed on another compatible queue. This may not be possible in the DPC++ prototype based on lazy queues. --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 462e7c62b8b74..1ff1872279d63 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -109,25 +109,23 @@ requirements were considered: built-in kernels. 7. Ability to record a graph with commands submitted to different devices in the same context. -8. A graph constructed using a device queue may be executed on another compatible - queue. -9. Capability to serialize graphs to a binary format which can then be +8. Capability to serialize graphs to a binary format which can then be de-serialized and executed. This is helpful for offline cases where a graph can be created by an offline tool to be loaded and run without the end-user incurring the overheads of graph creation. -10. Backend interoperability, the ability to retrieve a native graph object from +9. Backend interoperability, the ability to retrieve a native graph object from the graph and use that in a native backend API. To allow for prototype implementations of this extension to be developed quickly for evaluation the scope of this proposal was limited to a subset -of these requirements. In particular, the serialization functionality (9), -backend interoperability (10), and a profiling/debugging interface (3) were +of these requirements. In particular, the serialization functionality (8), +backend interoperability (9), and a profiling/debugging interface (3) were omitted. As these are not easy to abstract over a number of backends without significant investigation. It is also hoped these features can be exposed as additive changes to the API, and so in introduced in future versions of the extension. -Another reason for deferring a serialize/deserialize API (9) is that its scope +Another reason for deferring a serialize/deserialize API (8) is that its scope could extend from emitting the graph in a binary format, to emitting a standardized IR format that enables further device specific graph optimizations. From 29a630d0c8e8586da3e7d02fd0a5a271cad318a7 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Thu, 3 Nov 2022 09:39:58 +0000 Subject: [PATCH 05/11] Make motivation wording more generic Co-authored-by: Pablo Reble --- sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 1ff1872279d63..2e54950df1044 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -88,8 +88,8 @@ As well as benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications, as repetitive workloads no longer have to redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only -changing the data in input buffers or USM allocations. For machine learning -applications where the same command group pattern is run repeatedly for +changing the data in input buffers or USM allocations. For applications from specific domains, such as machine learning, +where the same command group pattern is run repeatedly for different inputs, this is particularly useful. === Requirements @@ -148,7 +148,7 @@ data dependencies of the command group. Each of these mechanisms for constructing a graph have their own advantages, so having both APIs available allows the user to pick the one which is most suitable for them. The queue recording API allows quicker porting of existing -applications, and can capture work done by a library in the graph. While the +applications, and can capture external work that is submitted to a queue, for example via library function calls. While the explicit API can better express what data is internal to the graph for optimization, and dependencies don't need to be inferred. From cf08b5333434230e64e1825d83b49a146accf177 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Thu, 3 Nov 2022 13:28:19 +0000 Subject: [PATCH 06/11] Split up graph terminology section Break the graph terminology section down into a generic description of a graph, nodes, and edges. Followed by subsections for how that is realized in the explicit API and record & replay API individually. --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 106 +++++++++++------- 1 file changed, 66 insertions(+), 40 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 2e54950df1044..cee1737074aec 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -88,9 +88,9 @@ As well as benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications, as repetitive workloads no longer have to redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only -changing the data in input buffers or USM allocations. For applications from specific domains, such as machine learning, -where the same command group pattern is run repeatedly for -different inputs, this is particularly useful. +changing the data in input buffers or USM allocations. For applications from +specific domains, such as machine learning, where the same command group pattern +is run repeatedly for different inputs, this is particularly useful. === Requirements @@ -148,9 +148,10 @@ data dependencies of the command group. Each of these mechanisms for constructing a graph have their own advantages, so having both APIs available allows the user to pick the one which is most suitable for them. The queue recording API allows quicker porting of existing -applications, and can capture external work that is submitted to a queue, for example via library function calls. While the -explicit API can better express what data is internal to the graph for -optimization, and dependencies don't need to be inferred. +applications, and can capture external work that is submitted to a queue, for +example via library function calls. While the explicit API can better express +what data is internal to the graph for optimization, and dependencies don't need +to be inferred. == Specification @@ -181,43 +182,68 @@ Table 2. Terminology. | Concept | Description | Graph -| `command_graph` class that stores structured commands and their dependencies. - -A SYCL graph is a collection of commands (nodes) and their dependencies (edges). -From the SYCL perspective, this graph will be acyclic and directed (DAG) as -users cannot express a cycle in the core SYCL API. +| A directed and acyclic graph (DAG) of commands (nodes) and their dependencies +(edges), represented by the `command_graph` class. | Node | A command, which can have different attributes. -When recording a queue to construct a graph, nodes in a SYCL graph represent -each of the command group submissions of the program. Each submission -encompasses either one or both of a.) some data movement, b.) a single -asynchronous kernel launch. Nodes cannot define forward edges, only backwards -(i.e. kernels can only create dependencies on things that have already -happened). This means that transparently a node can depend on a previously -recorded graph (sub-graph), which works by creating edges to the individual nodes -in the old graph. Explicit memory operations without kernels, such as a memory -copy, are still classed as nodes under this definition, as the -{explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as -specialized kernels executing on the device. - -In the explicit graph building API, nodes can also represent a memory allocation/free -operation on the device. - | Edge | Dependency between commands as a happens-before relationship. -When recording a queue to construct a graph, an edge in the SYCL graph represents -a data dependency between two nodes. These dependencies are expressed by the user -code through buffer accessors. There is also the partial ability to track USM -data dependencies provided the pointers used in the graph nodes are the same. -With the limitation that a node taking an offset USM pointer input will not be -identified as having an edge to another node taking a pointer input to the base -address of the same USM allocation. +|=== + +==== Explicit Graph Building API + +When using the explicit graph building API to construct a graph, nodes and +edges are captured as follows. + +Table 3. Explicit Graph Definition. +[%header,cols="1,3"] +|=== +| Concept | Description -In the explicit graph building API, `make_edge()` is used to define the dependency -rather than inferring them from data dependencies. +| Node +| In the explicit graph building API nodes are created by the user invoking +methods on a modifiable graph. Each node represent either a command-group +function, empty operation, or device memory allocation/free. + +| Edge +| In the explicit graph building API edges are defined by the user. This is +either through buffer accessors, the `make_edge()` free function, or by passing +dependent nodes on creation of a new node. +|=== + +==== Queue Recording API + +When using the record & replay API to construct a graph by recording a queue, +nodes and edges are captured as follows. + +Table 4. Recorded Graph Definition. +[%header,cols="1,3"] +|=== +| Concept | Description + +| Node +| Nodes in a queue recorded graph represent each of the command group +submissions of the program. Each submission encompasses either one or both of +a.) some data movement, b.) a single asynchronous kernel launch. Nodes cannot +define forward edges, only backwards (i.e. kernels can only create dependencies +on things that have already happened). This means that transparently a node can +depend on a previously recorded graph (sub-graph), which works by creating edges +to the individual nodes in the old graph. Explicit memory operations without +kernels, such as a memory copy, are still classed as nodes under this +definition, as the {explicit-memory-ops}[SYCL 2020 specification states] that +these can be seen as specialized kernels executing on the device. + +| Edge +| An edge in a queue recorded graph represents a data dependency between two +nodes. These dependencies are expressed by the user code through buffer +accessors. There is also the partial ability to track USM data dependencies +provided the pointers used in the graph nodes are the same. With the limitation +that a node taking an offset USM pointer input will not be identified as having +an edge to another node taking a pointer input to the base address of the same +USM allocation. |=== === API Modifications @@ -365,7 +391,7 @@ create the executable graphs, with the nodes added in the same order. ==== Graph Member Functions -Table 3. Constructor of the `command_graph` class. +Table 5. Constructor of the `command_graph` class. [cols="2a,a"] |=== |Constructor|Description @@ -391,7 +417,7 @@ Parameters: |=== -Table 4. Member functions of the `command_graph` class. +Table 6. Member functions of the `command_graph` class. [cols="2a,a"] |=== |Member function|Description @@ -461,7 +487,7 @@ Memory that is allocated by the following functions is owned by the specific graph. When freed inside the graph, the memory is only accessible before the `free` node is executed and after the `malloc` node is executed. -Table 5. Member functions of the `command_graph` class (memory operations). +Table 7. Member functions of the `command_graph` class (memory operations). [cols="2a,a"] |=== |Member function|Description @@ -500,7 +526,7 @@ Returns: The memory freeing node which has been added to the graph. |=== -Table 6. Member functions of the `command_graph` class (executable graph update). +Table 8. Member functions of the `command_graph` class (executable graph update). [cols="2a,a"] |=== |Member function|Description @@ -568,7 +594,7 @@ The state of a queue can be queried with `queue::get_info` using template parameter `info::queue::state`. The following entry is added to the {queue-info-table}[queue info table] to define this query: -Table 7. Queue info query +Table 9. Queue info query [cols="2a,a,a"] |=== | Queue Descriptors | Return Type | Description From c416ab36e252297451704923243ff3530acf6ee5 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Mon, 7 Nov 2022 13:59:31 +0000 Subject: [PATCH 07/11] Specify behaviour when mixing APIs Add wording on whether a user can combine the explicit graph building API with the record & replay API on the same modifiable graph object. It is specified as being allowed for the user to mix mechanisms, so long as the two mechanisms are used sequentially. However, it is forbidden if the mechanisms are interleaved and an exception must be thrown by the implementation. We decided this because it is not specified in the record & replay API whether commands are added to the graph eagerly during recording, or on `queue::end_recording`. When each mechanism is used sequentially however, the ordering of nodes being added is well defined. See Issue https://github.com/reble/llvm/issues/11 --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index cee1737074aec..afc5dd3c6eff1 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -153,6 +153,11 @@ example via library function calls. While the explicit API can better express what data is internal to the graph for optimization, and dependencies don't need to be inferred. +It is valid to combine these two mechanisms sequentially when constructing a +graph, however it is not valid to concurrently use them. An error will be thrown +if a user attempts to use the explicit API to add a node to a graph which is +being recorded to by a queue. + == Specification === Feature test macro @@ -338,6 +343,11 @@ Parameters: * `receiver` - Node which will be dependent on `sender`. +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to any graph associated with `sender` or `receiver`. + === Graph :crs: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics @@ -438,6 +448,11 @@ Parameters: Returns: The empty node which has been added to the graph. +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + | [source,c++] ---- @@ -457,6 +472,11 @@ Parameters: Returns: The command-group function object node which has been added to the graph. +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + | [source,c++] ---- @@ -509,6 +529,11 @@ Parameters: Returns: The memory allocation node which has been added to the graph +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + | [source,c++] ---- @@ -524,6 +549,11 @@ Parameters: Returns: The memory freeing node which has been added to the graph. +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + |=== Table 8. Member functions of the `command_graph` class (executable graph update). From 835065f3c052e2c142f4f8dc62a821cbcef5028a Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Tue, 8 Nov 2022 11:23:02 +0000 Subject: [PATCH 08/11] Address PR feedback * Move `make_edge` from a free function to a member of `command_graph`. * Remove USM limitation from spec wording, as it was an implementation detail. --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 83 ++++++++++--------- 1 file changed, 42 insertions(+), 41 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index afc5dd3c6eff1..142c5149ac81e 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -154,7 +154,7 @@ what data is internal to the graph for optimization, and dependencies don't need to be inferred. It is valid to combine these two mechanisms sequentially when constructing a -graph, however it is not valid to concurrently use them. An error will be thrown +graph, however it is not valid to use them concurrently. An error will be thrown if a user attempts to use the explicit API to add a node to a graph which is being recorded to by a queue. @@ -215,7 +215,7 @@ function, empty operation, or device memory allocation/free. | Edge | In the explicit graph building API edges are defined by the user. This is -either through buffer accessors, the `make_edge()` free function, or by passing +either through buffer accessors, the `make_edge()` function, or by passing dependent nodes on creation of a new node. |=== @@ -243,12 +243,10 @@ these can be seen as specialized kernels executing on the device. | Edge | An edge in a queue recorded graph represents a data dependency between two -nodes. These dependencies are expressed by the user code through buffer -accessors. There is also the partial ability to track USM data dependencies -provided the pointers used in the graph nodes are the same. With the limitation -that a node taking an offset USM pointer input will not be identified as having -an edge to another node taking a pointer input to the base address of the same -USM allocation. +nodes. Data dependencies can naturally be expressed in user code through buffer +accessors. There is also the ability for the SYCL runtime to track USM +allocation dependencies based on the pointers used in recorded command groups, +and construct an edge between nodes where the same USM allocation is used. |=== === API Modifications @@ -266,8 +264,6 @@ enum class queue_state { class node {}; -void make_edge(node sender, node receiver); - // State of a graph enum class graph_state { modifiable, @@ -291,6 +287,8 @@ public: node add_malloc_device(void *&data, size_t numBytes, const std::vector& dep = {}); node add_free(void *data, const std::vector& dep = {}); + + void make_edge(node sender, node receiver); }; template<> @@ -314,9 +312,10 @@ public: === Node -Node is a class that encapsulates tasks like SYCL kernel functions or host tasks -for deferred execution. A graph has to be created first, the structure of a -graph is defined second by adding nodes and edges. +Node is a class that encapsulates tasks like SYCL kernel functions, device +memory allocations/frees, or host tasks for deferred execution. A graph has to +be created first, the structure of a graph is defined second by adding nodes and +edges. [source,c++] ---- @@ -325,29 +324,6 @@ namespace sycl::ext::oneapi::experimental { } ---- -=== Edge - -A dependency between two nodes representing a happens-before relationship. -`sender` and `receiver` may be associated to different graphs. - -[source,c++] ----- -namespace sycl::ext::oneapi::experimental { - void make_edge(node sender, node receiver); // Adds a dependency between two nodes -} ----- - -Parameters: - -* `sender` - Node which will be a dependency of `receiver`. - -* `receiver` - Node which will be dependent on `sender`. - -Exceptions: - -* Throws synchronously with error code `invalid` if a queue is recording - commands to any graph associated with `sender` or `receiver`. - === Graph :crs: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics @@ -355,11 +331,9 @@ Exceptions: This extension adds a new `command_graph` object which follows the {crs}[common reference semantics] of other SYCL runtime objects. -`command_graph` is a class that represents a directed acyclic graph of nodes. A -graph can have different states, can be nested, can have multiple root nodes -that are scheduled for execution first and multiple leaf nodes that are -scheduled for execution last. The execution of a graph has been completed when -all leaf node tasks have been completed. +A `command_graph` represents a directed acyclic graph of nodes, where each node +represents a single command or a sub-graph. The execution of a graph completes +when all of its nodes have completed. A `command_graph` is built up by either recording queue submissions or explicitly adding nodes, then once the user is happy that the graph is complete, @@ -477,6 +451,33 @@ Exceptions: * Throws synchronously with error code `invalid` if a queue is recording commands to the graph. +| +[source,c++] +---- +void make_edge(node sender, node receiver); +---- + +|Creates a dependency between two nodes representing a happens-before relationship. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `sender` - Node which will be a dependency of `receiver`. + +* `receiver` - Node which will be dependent on `sender`. + +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + +* Throws synchronously with error code `invalid` if `sender` or `receiver` + are not valid nodes created from the graph. + | [source,c++] ---- From 39258ba6d8fe3b326cbb8d7832d30a9299160528 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Wed, 9 Nov 2022 09:27:34 +0000 Subject: [PATCH 09/11] Add system allocation USM restriction * Abbreviate wording on record & replay edges around USM pointers, and add restriction that offsets into system allocations are not supported. * Fix typo --- .../doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 142c5149ac81e..dc5d95050ab38 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -244,9 +244,8 @@ these can be seen as specialized kernels executing on the device. | Edge | An edge in a queue recorded graph represents a data dependency between two nodes. Data dependencies can naturally be expressed in user code through buffer -accessors. There is also the ability for the SYCL runtime to track USM -allocation dependencies based on the pointers used in recorded command groups, -and construct an edge between nodes where the same USM allocation is used. +accessors. USM pointers also convey data dependencies, however offsets into +system allocations (`malloc`/`new`) are not supported. |=== === API Modifications @@ -1000,7 +999,7 @@ this feature in the extension. **Outcome:** Unresolved -=== Graph Submimssion Synchronization +=== Graph Submission Synchronization Should we provide a mechanism for a graph submission to depend on other graph submission events or any arbitrary sycl event? From f17605b1dbf8b112809975eacc98a783f02ed4ba Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Wed, 9 Nov 2022 13:36:18 +0000 Subject: [PATCH 10/11] Fix graph parameter description in queue::submit --- sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index dc5d95050ab38..2cc12c370ed2f 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -721,7 +721,7 @@ completed is backend specific. The runtime may throw an error. Parameters: -* `graph` - Graph object to start recording commands to. +* `graph` - Graph object to execute. When the queue is in the execution state, an `event` object used to synchronize graph submission with other command-groups is returned. Otherwise the queue is From 87d0baf3162ec30f54ee45cce53875e4e83fb3b2 Mon Sep 17 00:00:00 2001 From: Ewan Crawford Date: Thu, 10 Nov 2022 09:45:42 +0000 Subject: [PATCH 11/11] Address Julian's feedback * Make the use of preconditions and namespaces on the definitions of new functions consistent * Add diagram for topology of dotp example * Tweak language --- .../proposed/sycl_ext_oneapi_graph.asciidoc | 105 +++++++++++------- 1 file changed, 67 insertions(+), 38 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc index 2cc12c370ed2f..ed8f4f7075662 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -59,13 +59,14 @@ not rely on APIs defined in this specification.* == Introduction -Through the use of command groups SYCL is already able to create a DAG of kernel -execution at runtime, as a command group object defines a set of requisites -(edges) which must be satisfied for kernels (nodes) to be executed. However, -because command-group submission is tied to execution on the queue, without -having a prior construction step before starting execution, optimization -opportunities are missed from the runtime not knowing the complete dependency -graph ahead of execution. +Through the use of command groups SYCL is already able to create a dependency +graph (in the form of a directed acyclic graph) of kernel execution at runtime, +as a command group object defines a set of requisites (edges) which must be +satisfied for kernels (nodes) to be executed. However, because command-group +submission is tied to execution on the queue, without having a prior +construction step before starting execution, optimization opportunities are +missed from the runtime not knowing the complete dependency graph ahead of +execution. The following benefits would become possible if the user could define a dependency graph to the SYCL runtime prior to execution: @@ -105,7 +106,7 @@ requirements were considered: 4. Integrate sub-graphs (previously constructed graphs) when constructing a new graph. 5. Support the USM model of memory as well as buffer model. -6. Compatible with other SYCL extensions and features, e.g kernel fusion & +6. Compatible with other SYCL extensions and features, e.g. kernel fusion & built-in kernels. 7. Ability to record a graph with commands submitted to different devices in the same context. @@ -122,7 +123,7 @@ of these requirements. In particular, the serialization functionality (8), backend interoperability (9), and a profiling/debugging interface (3) were omitted. As these are not easy to abstract over a number of backends without significant investigation. It is also hoped these features can be exposed as -additive changes to the API, and so in introduced in future versions of the +additive changes to the API, and thus introduced in future versions of the extension. Another reason for deferring a serialize/deserialize API (8) is that its scope @@ -382,6 +383,7 @@ Table 5. Constructor of the `command_graph` class. | [source,c++] ---- +using namespace ext::oneapi::experimental; command_graph(const property_list &propList = {}); ---- |Creates a SYCL `command_graph` object in the modifiable state. @@ -408,6 +410,7 @@ Table 6. Member functions of the `command_graph` class. | [source,c++] ---- +using namespace ext::oneapi::experimental; node add(const std::vector& dep = {}); ---- |This creates an empty node which is associated to no task. Its intended use is @@ -415,6 +418,11 @@ either a connection point inside a graph between groups of nodes, and can significantly reduce the number of edges ( O(n) vs. O(n^2) ). Another use-case is building the structure of a graph first and adding tasks later. +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + Parameters: * `dep` - Nodes the created node will be dependent on. @@ -429,6 +437,7 @@ Exceptions: | [source,c++] ---- +using namespace ext::oneapi::experimental; template node add(T cgf, const std::vector& dep = {}); ---- @@ -437,6 +446,11 @@ object can contain single or multiple commands such as a host task which is scheduled by the SYCL runtime or a SYCL function for invoking kernels with all restrictions that apply as described in the core specification. +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + Parameters: * `cgf` - Command group function object to be added as a node @@ -453,6 +467,7 @@ Exceptions: | [source,c++] ---- +using namespace ext::oneapi::experimental; void make_edge(node sender, node receiver); ---- @@ -480,6 +495,7 @@ Exceptions: | [source,c++] ---- +using namespace ext::oneapi::experimental; command_graph finalize(const context &syclContext) const; ---- @@ -515,9 +531,15 @@ Table 7. Member functions of the `command_graph` class (memory operations). | [source,c++] ---- +using namespace ext::oneapi::experimental; node add_malloc_device(void *&data, size_t numBytes, const std::vector& dep = {}); ---- -|Adding a node that encapsulates a `malloc` operation. +|Adding a node that encapsulates a memory allocation operation. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. Parameters: @@ -537,9 +559,15 @@ Exceptions: | [source,c++] ---- +using namespace ext::oneapi::experimental; node add_free(void *data, const std::vector& dep = {}); ---- -|Adding a node that encapsulates a `free` operation. +|Adding a node that encapsulates a memory free operation. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. Parameters: @@ -564,6 +592,7 @@ Table 8. Member functions of the `command_graph` class (executable graph update) | [source, c++] ---- +using namespace ext::oneapi::experimental; void command_graph update(const command_graph &graph); ---- @@ -572,17 +601,17 @@ identical modifiable graph. The effects of the update will be visible on the next submission of the executable graph without the need for additional user synchronization. +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::executable`. + Parameters: * `graph` - Modifiable graph object to update graph node inputs & outputs with. This graph must have the same topology as the original graph used on executable graph creation. -Preconditions: - -* This member function is only available when the `command_graph` state is - `graph_state::executable`. - Exceptions: * Throws synchronously with error code `invalid` if the topology of `graph` is @@ -697,6 +726,7 @@ Exceptions: | [source, c++] ---- +using namespace ext::oneapi::experimental; bool queue::end_recording() ---- @@ -826,33 +856,45 @@ int main() { float *dotp = sycl::malloc_shared(1, q); - auto n_x = g.add_malloc_device(x, n); - auto n_y = g.add_malloc_device(y, n); - auto n_z = g.add_malloc_device(z, n); + // Add commands to the graph to create the following topology. + // + // x y z + // \ | / + // i + // / \ + // a b + // \ / \ + // c fy + // | + // fx + + auto node_x = g.add_malloc_device(x, n * sizeof(float)); + auto node_y = g.add_malloc_device(y, n * sizeof(float)); + auto node_z = g.add_malloc_device(z, n * sizeof(float)); /* init data on the device */ - auto n_i = g.add([&](sycl::handler &h) { + auto node_i = g.add([&](sycl::handler &h) { h.parallel_for(n, [=](sycl::id<1> it){ const size_t i = it[0]; x[i] = 1.0f; y[i] = 2.0f; z[i] = 3.0f; }); - }, {n_x, n_y, n_z}); + }, {node_x, node_y, node_z}); auto node_a = g.add([&](sycl::handler &h) { h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; x[i] = alpha * x[i] + beta * y[i]; }); - }, {n_i}); + }, {node_i}); auto node_b = g.add([&](sycl::handler &h) { h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { const size_t i = it[0]; z[i] = gamma * z[i] + beta * y[i]; }); - }, {n_i}); + }, {node_i}); auto node_c = g.add( [&](sycl::handler &h) { @@ -865,8 +907,8 @@ int main() { }, {node_a, node_b}); - auto node_f1 = g.add_free(x, {node_c}); - auto node_f2 = g.add_free(y, {node_b}); + auto node_fx = g.add_free(x, {node_c}); + auto node_fy = g.add_free(y, {node_b}); auto exec = g.finalize(q.get_context()); @@ -962,19 +1004,6 @@ submitted as a whole for execution via q.submit(exec_graph); ---- -// == Issues for later investigations -// -// . Explicit memory movement can cause POC to stall. -// -// == Non-implemented features -// Please, note that the following features are not yet implemented: -// -// . Level Zero backend only -// . Memory operation nodes not implemented -// . Host node not implemented -// . Submit overload of a queue. `submit(graph)` Use a combination of `executable_graph::exec_and_wait()` and queue property `sycl::ext::oneapi::property::queue::lazy_execution{}` instead. -// . `class graph` Use dedicated `class graph` (equivalent to `graph_state == modifiable`) and `class executable_graph` (equivalent to `graph_state == executable`) instead. - == Issues === Multi Device Graph