Skip to content

Handler-less kernel submit API #19294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 28 commits into
base: sycl
Choose a base branch
from

Conversation

slawekptak
Copy link
Contributor

No description provided.

@@ -3680,6 +3743,21 @@ class __SYCL_EXPORT queue : public detail::OwnerLessBase<queue> {
const detail::code_location &CodeLoc,
bool IsTopCodeLoc) const;

event submit_with_event_impl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about eventless? It is not done yet, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be similar, so I've skipped it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the disadvantage of returning optional<event> and having somewhere (probably, in SubmissionInfo, as this is mode of submission) a flag, pointing out is it event or eventless mode? I think about bunch of functions that pass arguments by chain and about duplicating them (for event and for eventless) and this is not looks good. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if returning std::optional<event> is a good idea because of ABI concerns. It might not have a stable ABI across compiler versions or even different standard libraries (libstdc++ vs libc++).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if returning std::optional<event> is a good idea because of ABI concerns. It might not have a stable ABI across compiler versions or even different standard libraries (libstdc++ vs libc++).

Yes, good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, sycl::detail::optional might be considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl::detail::optional might work, good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still we need to care about the stable layout of the sycl::detail::optional. I am not sure that we are doing it today.

I think having two versions (that return sycl::event and return void) might be a good alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't generally do that and we rely on backward compatibility guarantees of the C++ library we use (GNU libstdc++ on Linux/MSVC on Windows). The only exception is pre-C++11 ABI of GNU libstdc++ that pyTorch used to use (see https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). I don't see std::optional listed on that page, so we should be safe to use it.

Another possible caveat is if some STL's implementation of it isn't is_sycl_device_copyable. I think that might have been a reason why we added sycl::detail::optional (or maybe it was simply added when we used C++14, `std::optional' is C++17 and above).

Anyway, unless you have a known case when it doesn't work, the current approach in the rest of the project is to use std::optional, AFAIK.

Copy link
Contributor

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I would like to see at least one public interface implementation that utilizes this approach, just to ensure it works.

@slawekptak
Copy link
Contributor Author

In this PR, I would like to see at least one public interface implementation that utilizes this approach, just to ensure it works.

In the latest update, there are two public interfaces: The enqueue functions extension, and queue.parallel_for. Both are enabled only if __DPCPP_ENABLE_UNFINISHED_NO_CGH_SUBMIT is defined.

expose the new APIs as public under a new define
template <typename KernelName = sycl::detail::auto_name, typename PropertiesT,
typename KernelType, int Dims>
void submit(const queue &Q, PropertiesT Props, nd_range<Dims> Range,
const KernelType &KernelFunc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rvalue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the convention in this file is to pass the KernelFunc as lvalue reference. Maybe it would make sense to change it everywhere in a separate PR, for consistency.

Comment on lines +182 to +183
int &KernelNumArgs() { return MKernelNumArgs; }
const int &KernelNumArgs() const { return MKernelNumArgs; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer explicit getter/setter methods instead of this ugly approach that forces us to return int by reference. Also caller code that uses this class will look better and less error-prone with explicit getter/setter methods.

Copy link
Contributor

@aelovikov-intel aelovikov-intel Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if we need a setter, then design is wrong. Why would a kernel change its number of arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the ProcessKernelRuntimeInfo function should be redesigned to return the KernelRuntimeInfo object instead of accepting one by reference and initializing via setter methods. We can get rid of setters and initialize it using ctor.

But the main question is asked by @aelovikov-intel above if we need KernelRuntimeInfo at all.

}
}

Args = extractArgsAndReqsFromLambda(KRInfo.GetKernelFuncPtr(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the path with a handler, are we extracting args on every submission? Can we cache it somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a given kernel type, the value of arguments can change between the invocations, right?

Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper approach should be to separate handler-based submit implementation to properly separate lifetime extension (via copy/move) and the actual enqueue that would be handler-less. Every new handler-less API should be immediately used by handler-based submission path by delegating to it.

Just writing a bunch of new code on the side without integrating it into existing submission path is a very bad choice.

Comment on lines +182 to +183
int &KernelNumArgs() { return MKernelNumArgs; }
const int &KernelNumArgs() const { return MKernelNumArgs; }
Copy link
Contributor

@aelovikov-intel aelovikov-intel Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if we need a setter, then design is wrong. Why would a kernel change its number of arguments?

Comment on lines +155 to +157
// This class is intended to store the kernel runtime information,
// extracted from the compile time kernel structures.
class __SYCL_EXPORT KernelRuntimeInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this unified with @sergey-semenov 's kernel-name-based cache? They both serve the same purpose of type-erasing kernel information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this info is stored in the handler, and this is a new structure which wraps it for no-handler cases. Are you suggesting, that this should be moved from the handler to the kernel name based cache, and then used in both flows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that this should be moved from the handler to the kernel name based cache, and then used in both flows

That sounds very reasonable.

#ifdef __DPCPP_ENABLE_UNFINISHED_NO_CGH_SUBMIT
template <typename KernelName, typename PropertiesT, typename KernelType,
int Dims>
void submit_direct_impl(const queue &Q, PropertiesT Props, nd_range<Dims> Range,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having lots of layers of tiny template helpers is bad for compile time, why can't it be inlined?

Ideally, most of interfaces accepting the kernel type as a template param must process compile-time properties immediately and only call interfaces that accept type-erased kernel.

Additionally, less tiny layers makes the code much easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code follows the convention which is there for handler based submissions. We should probably refactor the entire file in a separate PR (after this PR is merged).

@@ -312,6 +312,57 @@ event queue::submit_with_event_impl(
return impl->submit_with_event(CGH, SubmitInfo, CodeLoc, IsTopCodeLoc);
}

#ifdef __INTEL_PREVIEW_BREAKING_CHANGES
event queue::submit_direct_with_event_impl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having overloads vs templates is important for public APIs (because of the implicit conversion), but for our implementation details we can just use template to reduce amount of boilerplate code: https://godbolt.org/z/rPW4jx8h7

Headers will only have template declaration, .cpp file will export necessary instantiations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do this. Let's wait until the ABI is more stable, since we might be able to simplify and avoid the dimensions template here.

Comment on lines +195 to +198
bool &KernelHasSpecialCaptures() { return MKernelHasSpecialCaptures; }
const bool &KernelHasSpecialCaptures() const {
return MKernelHasSpecialCaptures;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is it to support that scenario (having special captures)? I'd hope that we will be changing the decomposition approach relatively soon and that code path will look very different in how we pass/set kernel arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR it is only used to exclude an unsupported case, so I would leave it for now, until the approach is changed.

#ifdef __DPCPP_ENABLE_UNFINISHED_NO_CGH_SUBMIT
template <typename KernelName = sycl::detail::auto_name, typename PropertiesT,
typename KernelType, int Dims>
event submit_with_event(const queue &Q, PropertiesT Props, nd_range<Dims> Range,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find where this function is used. Could you please clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the API to be called by the app, when an event is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's for unimplemented yet part of code. Thanks!

typename TransformUserItemType<Dims, LambdaArgType>::type>;

KRInfo.HostKernel().reset(
new detail::HostKernel<KernelType, TransformedArgType, Dims>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to allocate it on heap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the scheduler takes a shared_ptr type argument for the HostKernel, and stores it until the kernel is actually submitted. Do you think we should rather pass the object by value to the scheduler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants