Skip to content

Commit cbb6fea

Browse files
committed
[SYCL][Ext] Query kernel maximum active work-groups based on occupancy
The currently proposed and implemented query is `max_num_work_group_occupancy_per_cu` which retrieves the maximum actively executing workgroups based on compute unit occupancy granularity. This commit also fixes an issue in the `max_num_work_group_sync` query that could have previously lead to out of launch resources issue. Additionally, it also overloads the `max_num_num_work_group_sync` query to take extra parameters for local work-group size and local dynamic memory size (in bytes) in order to be allow users to pass those important resource usage factors to the query, so they are take in account in the final group count suggestion. This overload is currently only usable when targetting Cuda.
1 parent 7193c26 commit cbb6fea

File tree

14 files changed

+367
-19
lines changed

14 files changed

+367
-19
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
= sycl_ext_oneapi_group_occupancy_queries
2+
3+
:source-highlighter: coderay
4+
:coderay-linenums-mode: table
5+
6+
// This section needs to be after the document title.
7+
:doctype: book
8+
:toc2:
9+
:toc: left
10+
:encoding: utf-8
11+
:lang: en
12+
:dpcpp: pass:[DPC++]
13+
14+
// Set the default source code type in this document to C++,
15+
// for syntax highlighting purposes. This is needed because
16+
// docbook uses c++ and html5 uses cpp.
17+
:language: {basebackend@docbook:c++:cpp}
18+
19+
20+
== Notice
21+
22+
[%hardbreaks]
23+
Copyright (C) 2024 Intel Corporation. All rights reserved.
24+
25+
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
26+
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
27+
permission by Khronos.
28+
29+
30+
== Contact
31+
32+
To report problems with this extension, please open a new issue at:
33+
34+
https://github.com/intel/llvm/issues
35+
36+
37+
== Dependencies
38+
39+
This extension is written against the SYCL 2020 revision 5 specification. All
40+
references below to the "core SYCL specification" or to section numbers in the
41+
SYCL specification refer to that revision.
42+
43+
This extension also depends on the following other SYCL extensions:
44+
45+
* link:../proposed/sycl_ext_oneapi_launch_queries.asciidoc[
46+
sycl_ext_oneapi_launch_queries]
47+
48+
49+
== Status
50+
51+
This is an experimental extension specification, intended to provide early
52+
access to features and gather community feedback. Interfaces defined in this
53+
specification are implemented in {dpcpp}, but they are not finalized and may
54+
change incompatibly in future versions of {dpcpp} without prior notice.
55+
*Shipping software products should not rely on APIs defined in this
56+
specification.*
57+
58+
59+
== Overview
60+
61+
This extension is based on the kernel-queue-specific specific querying mechanism
62+
introduced by the sycl_ext_oneapi_launch_queries extension.
63+
64+
The purpose of queries the to be added is to aid occupancy based calculations
65+
for kernel launches based on hardware occupancy per compute unit granularity.
66+
The queries take in account the kernel resources and user-specified constraints,
67+
such as, but not limited to, local (work-group) size and dynamic work-group
68+
local memory (in bytes). The motivation behind is to aid the tuning of kernels,
69+
by being able to design the algorithm's implementation to maintain the highest
70+
possible occupancy in a portable way.
71+
72+
List of currently planned queries.
73+
* max_num_work_group_occupancy_per_cu
74+
75+
[source,c++]
76+
----
77+
sycl::queue q{};
78+
auto bundle = sycl::get_kernel_bundle(q.get_context());
79+
auto kernel = bundle.get_kernel<class KernelName>();
80+
81+
auto wgSizeRange = sycl::range{32, 1, 1};
82+
size_t localMemorySize = 32;
83+
84+
namespace syclex = sycl::ext::oneapi::experimental;
85+
uint32_t maxWGsPerCU = kernel.ext_oneapi_get_info<
86+
syclex::info::kernel_queue_specific::max_num_work_group_occupancy_per_cu>(
87+
q, wgSizeRange, localMemorySize);
88+
----
89+
90+
NOTE: SYCL 2020 requires lambdas to be named in order to locate the associated
91+
`sycl::kernel` object used to query information descriptors. Reducing the
92+
verbosity of the queries shown above is left to a future extension.
93+
94+
95+
== Specification
96+
97+
=== Feature test macro
98+
99+
This extension provides a feature-test macro as described in the core SYCL
100+
specification. An implementation supporting this extension must predefine the
101+
macro `SYCL_EXT_ONEAPI_GROUP_OCCUPANCY_QUERIES` to one of the values defined in
102+
the table below. Applications can test for the existence of this macro to
103+
determine if the implementation supports this feature, or applications can test
104+
the macro's value to determine which of the extension's features the
105+
implementation supports.
106+
107+
[%header,cols="1,5"]
108+
|===
109+
|Value
110+
|Description
111+
112+
|1
113+
|The APIs of this experimental extension are not versioned, so the
114+
feature-test macro always has this value.
115+
|===
116+
117+
118+
=== Occupancy queries
119+
120+
[source, c++]
121+
----
122+
namespace ext::oneapi::experimental::info::kernel {
123+
124+
struct max_num_work_group_occupancy_per_cu;
125+
126+
}
127+
----
128+
129+
[%header,cols="1,5,5,5"]
130+
|===
131+
|Kernel Descriptor
132+
|Argument Types
133+
|Return Type
134+
|Description
135+
136+
|`max_num_work_group_occupancy_per_cu`
137+
|`sycl::queue`, `sycl::range`, `size_t`
138+
|`uint32_t`
139+
|Returns the maximum number of actively executing work-groups per compute unit
140+
granularity, when the kernel is submitted to the specified queue with the
141+
specified work-group size and the specified amount of dynamic work-group local
142+
memory (in bytes). The actively executing work-groups are those that occupy
143+
the fundamental hardware unit responsible for the execution of work-groups in
144+
parallel.
145+
146+
|===
147+
148+
== Implementation notes
149+
150+
The implementation needs to define `sycl::kernel::ext_onapi_get_info` with the
151+
extra `sycl::range` and `size_t` parameters in addition to the `sycl::queue`.
152+
153+
The Cuda, Hip and Level Zero backend adapters have the required infrastructure
154+
required to implement the extension.

sycl/include/sycl/detail/info_desc_helpers.hpp

+12
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ template <typename T> struct is_queue_info_desc : std::false_type {};
3030
template <typename T> struct is_kernel_info_desc : std::false_type {};
3131
template <typename T>
3232
struct is_kernel_device_specific_info_desc : std::false_type {};
33+
template <typename T>
34+
struct is_kernel_queue_specific_info_desc : std::false_type {};
3335
template <typename T> struct is_event_info_desc : std::false_type {};
3436
template <typename T> struct is_event_profiling_info_desc : std::false_type {};
3537
// Normally we would just use std::enable_if to limit valid get_info template
@@ -128,6 +130,16 @@ struct IsSubGroupInfo<info::kernel_device_specific::compile_sub_group_size>
128130
#include <sycl/info/ext_intel_device_traits.def>
129131
#include <sycl/info/ext_oneapi_device_traits.def>
130132
#undef __SYCL_PARAM_TRAITS_SPEC
133+
134+
#define __SYCL_PARAM_TRAITS_SPEC(Namespace, DescType, Desc, ReturnT, PiCode) \
135+
template <> \
136+
struct is_##DescType##_info_desc<Namespace::info::DescType::Desc> \
137+
: std::true_type { \
138+
using return_type = Namespace::info::DescType::Desc::return_type; \
139+
};
140+
#include <sycl/info/ext_kernel_queue_specific_traits.def>
141+
#undef __SYCL_PARAM_TRAITS_SPEC
142+
131143
#define __SYCL_PARAM_TRAITS_SPEC(DescType, Desc, ReturnT, PiCode) \
132144
template <> \
133145
struct is_backend_info_desc<info::DescType::Desc> : std::true_type { \

sycl/include/sycl/detail/pi.h

+3-1
Original file line numberDiff line numberDiff line change
@@ -565,7 +565,9 @@ typedef enum {
565565
PI_KERNEL_GROUP_INFO_PREFERRED_WORK_GROUP_SIZE_MULTIPLE = 0x11B3,
566566
PI_KERNEL_GROUP_INFO_PRIVATE_MEM_SIZE = 0x11B4,
567567
// The number of registers used by the compiled kernel (device specific)
568-
PI_KERNEL_GROUP_INFO_NUM_REGS = 0x10112
568+
PI_KERNEL_GROUP_INFO_NUM_REGS = 0x10112,
569+
PI_EXT_CODEPLAY_KERNEL_GROUP_INFO_MAX_NUM_ACTIVE_WORK_GROUPS = 0x10113,
570+
PI_EXT_CODEPLAY_KERNEL_GROUP_INFO_MAX_NUM_ACTIVE_WORK_GROUPS_NO_CACHE = 0x10114
569571
} _pi_kernel_group_info;
570572

571573
typedef enum {

sycl/include/sycl/ext/oneapi/experimental/device_architecture.hpp

+7
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,9 @@ static constexpr ext::oneapi::experimental::architecture
194194
#ifndef __SYCL_TARGET_NVIDIA_GPU_SM90__
195195
#define __SYCL_TARGET_NVIDIA_GPU_SM90__ 0
196196
#endif
197+
#ifndef __SYCL_TARGET_NVIDIA_GPU_SM90A__
198+
#define __SYCL_TARGET_NVIDIA_GPU_SM90A__ 0
199+
#endif
197200
#ifndef __SYCL_TARGET_AMD_GPU_GFX700__
198201
#define __SYCL_TARGET_AMD_GPU_GFX700__ 0
199202
#endif
@@ -357,6 +360,7 @@ static constexpr bool is_allowable_aot_mode =
357360
(__SYCL_TARGET_NVIDIA_GPU_SM87__ == 1) ||
358361
(__SYCL_TARGET_NVIDIA_GPU_SM89__ == 1) ||
359362
(__SYCL_TARGET_NVIDIA_GPU_SM90__ == 1) ||
363+
(__SYCL_TARGET_NVIDIA_GPU_SM90A__ == 1) ||
360364
(__SYCL_TARGET_AMD_GPU_GFX700__ == 1) ||
361365
(__SYCL_TARGET_AMD_GPU_GFX701__ == 1) ||
362366
(__SYCL_TARGET_AMD_GPU_GFX702__ == 1) ||
@@ -529,6 +533,9 @@ get_current_architecture_aot() {
529533
#if __SYCL_TARGET_NVIDIA_GPU_SM90__
530534
return ext::oneapi::experimental::architecture::nvidia_gpu_sm_90;
531535
#endif
536+
#if __SYCL_TARGET_NVIDIA_GPU_SM90A__
537+
return ext::oneapi::experimental::architecture::nvidia_gpu_sm_90a;
538+
#endif
532539
#if __SYCL_TARGET_AMD_GPU_GFX700__
533540
return ext::oneapi::experimental::architecture::amd_gpu_gfx700;
534541
#endif

sycl/include/sycl/ext/oneapi/experimental/root_group.hpp

+1-3
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,7 @@ namespace ext::oneapi::experimental {
2222
namespace info::kernel_queue_specific {
2323
// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension once
2424
// #7598 is merged.
25-
struct max_num_work_group_sync {
26-
using return_type = size_t;
27-
};
25+
// Defined in 'sycl/info/kernel_device_specific_traits.def'
2826
} // namespace info::kernel_queue_specific
2927

3028
template <int Dimensions> class root_group {

sycl/include/sycl/info/info_desc.hpp

+2
Original file line numberDiff line numberDiff line change
@@ -207,6 +207,8 @@ struct work_item_progress_capabilities;
207207
#include <sycl/info/ext_codeplay_device_traits.def>
208208
#include <sycl/info/ext_intel_device_traits.def>
209209
#include <sycl/info/ext_oneapi_device_traits.def>
210+
#include <sycl/info/ext_kernel_queue_specific_traits.def>
211+
210212
#undef __SYCL_PARAM_TRAITS_SPEC
211213
#undef __SYCL_PARAM_TRAITS_TEMPLATE_SPEC
212214
} // namespace _V1

sycl/include/sycl/kernel.hpp

+7-1
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,13 @@ class __SYCL_EXPORT kernel : public detail::OwnerLessBase<kernel> {
174174
// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension
175175
// once #7598 is merged.
176176
template <typename Param>
177-
typename Param::return_type ext_oneapi_get_info(const queue &q) const;
177+
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
178+
ext_oneapi_get_info(const queue &q) const;
179+
180+
template <typename Param>
181+
typename detail::is_kernel_queue_specific_info_desc<Param>::return_type
182+
ext_oneapi_get_info(const queue &Queue, const range<3> &WorkGroupSize,
183+
size_t DynamicLocalMemorySize) const;
178184

179185
private:
180186
/// Constructs a SYCL kernel object from a valid kernel_impl instance.

sycl/plugins/unified_runtime/CMakeLists.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -119,8 +119,8 @@ if(SYCL_PI_UR_USE_FETCH_CONTENT)
119119
)
120120

121121
fetch_adapter_source(cuda
122-
${UNIFIED_RUNTIME_REPO}
123-
${UNIFIED_RUNTIME_TAG}
122+
https://github.com/GeorgeWeb/unified-runtime.git
123+
f0cb1c8bea3347078cc08909d10b3b78f58fdebc
124124
)
125125

126126
fetch_adapter_source(hip

sycl/source/detail/kernel_impl.hpp

+65-7
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,11 @@ class kernel_impl {
153153
template <typename Param>
154154
typename Param::return_type ext_oneapi_get_info(const queue &q) const;
155155

156+
template <typename Param>
157+
typename Param::return_type
158+
ext_oneapi_get_info(const queue &Queue, const range<3> &MaxWorkGroupSize,
159+
size_t DynamicLocalMemorySize) const;
160+
156161
/// Get a reference to a raw kernel object.
157162
///
158163
/// \return a reference to a valid PiKernel instance with raw kernel object.
@@ -269,22 +274,75 @@ kernel_impl::get_info(const device &Device,
269274
getPlugin());
270275
}
271276

277+
namespace syclex = ext::oneapi::experimental;
278+
279+
template <>
280+
inline typename syclex::info::kernel_queue_specific::
281+
max_num_work_group_occupancy_per_cu::return_type
282+
kernel_impl::ext_oneapi_get_info<syclex::info::kernel_queue_specific::
283+
max_num_work_group_occupancy_per_cu>(
284+
const queue &Queue, const range<3> &WorkGroupSize,
285+
size_t DynamicLocalMemorySize) const {
286+
if (WorkGroupSize.size() == 0) {
287+
throw runtime_error("The launch work-group size cannot be zero.",
288+
PI_ERROR_INVALID_WORK_GROUP_SIZE);
289+
}
290+
291+
const auto &Plugin = getPlugin();
292+
const auto &Handle = getHandleRef();
293+
const auto &Device = Queue.get_device();
294+
295+
// Calculate max number of work-groups per compute unit
296+
const auto NumCUs = Device.get_info<info::device::max_compute_units>();
297+
298+
pi_uint32 GroupCount{0};
299+
Plugin->call<PiApiKind::piextKernelSuggestMaxCooperativeGroupCount>(
300+
Handle, WorkGroupSize.size(), DynamicLocalMemorySize, &GroupCount);
301+
return GroupCount / NumCUs;
302+
}
303+
272304
template <>
273-
inline typename ext::oneapi::experimental::info::kernel_queue_specific::
274-
max_num_work_group_sync::return_type
305+
inline typename syclex::info::kernel_queue_specific::max_num_work_group_sync::
306+
return_type
275307
kernel_impl::ext_oneapi_get_info<
276-
ext::oneapi::experimental::info::kernel_queue_specific::
277-
max_num_work_group_sync>(const queue &Queue) const {
308+
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
309+
[[maybe_unused]] const queue &Queue, const range<3> &WorkGroupSize,
310+
size_t DynamicLocalMemorySize) const {
311+
if (WorkGroupSize.size() == 0) {
312+
throw runtime_error("The launch work-group size cannot be zero.",
313+
PI_ERROR_INVALID_WORK_GROUP_SIZE);
314+
}
315+
278316
const auto &Plugin = getPlugin();
279317
const auto &Handle = getHandleRef();
280-
const auto MaxWorkGroupSize =
281-
Queue.get_device().get_info<info::device::max_work_group_size>();
318+
282319
pi_uint32 GroupCount = 0;
283320
Plugin->call<PiApiKind::piextKernelSuggestMaxCooperativeGroupCount>(
284-
Handle, MaxWorkGroupSize, /* DynamicSharedMemorySize */ 0, &GroupCount);
321+
Handle, WorkGroupSize.size(), DynamicLocalMemorySize, &GroupCount);
285322
return GroupCount;
286323
}
287324

325+
template <>
326+
inline typename syclex::info::kernel_queue_specific::max_num_work_group_sync::
327+
return_type
328+
kernel_impl::ext_oneapi_get_info<
329+
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
330+
const queue &Queue) const {
331+
const auto &Device = Queue.get_device();
332+
// Prevent out of launch resources for Cuda if this is used for calculating
333+
// the total work group size for kernel launches, by restricting the max size
334+
// to the kernel_device_specific maximum.
335+
const auto MaxWorkGroupSize =
336+
(Device.get_backend() == backend::ext_oneapi_cuda)
337+
? get_info<info::kernel_device_specific::work_group_size>(Device)
338+
: Device.get_info<info::device::max_work_group_size>();
339+
340+
return ext_oneapi_get_info<
341+
syclex::info::kernel_queue_specific::max_num_work_group_sync>(
342+
Queue, sycl::range{MaxWorkGroupSize, 1, 1},
343+
/* DynamicLocalMemorySize */ 0);
344+
}
345+
288346
} // namespace detail
289347
} // namespace _V1
290348
} // namespace sycl

sycl/source/feature_test.hpp.in

+1
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ inline namespace _V1 {
108108
#define SYCL_EXT_ONEAPI_FREE_FUNCTION_KERNELS 1
109109
#define SYCL_EXT_ONEAPI_PROD 1
110110
#define SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS 1
111+
#define SYCL_EXT_ONEAPI_GROUP_OCCUPANCY_QUERIES 1
111112

112113
#ifndef __has_include
113114
#define __has_include(x) 0

0 commit comments

Comments
 (0)