Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Implement max_num_work_groups from the launch queries extension #14333

Merged
merged 44 commits into from
Sep 11, 2024

Conversation

GeorgeWeb
Copy link
Contributor

@GeorgeWeb GeorgeWeb commented Jun 27, 2024

This PR implements the max_num_work_groups query from the sycl_ext_oneapi_launch_queries extension.

Additionally, this PR introduces changes that overload ext_oneapi_get_info for another kernel-queue-specific query - max_num_work_group_sync to take extra parameters for local work-group size and dynamic local memory size (in bytes) in order to allow users to pass those runtime resource limiting factors to the query, so they are taken into account in the final group count suggestion.

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Jun 27, 2024

This is required for CUTLASS.
I am converting to DRAFT for now as there is a unified-runtime dependency: oneapi-src/unified-runtime#1796.

Update: Ready for review.

The currently proposed and implemented query is `max_num_work_group_occupancy_per_cu`
which retrieves the maximum actively executing workgroups based on compute unit occupancy
granularity.

This commit also fixes an issue in the `max_num_work_group_sync` query that could
have previously lead to out of launch resources issue.

Additionally, it also overloads the `max_num_num_work_group_sync` query to take
extra parameters for local work-group size and local dynamic memory size (in
bytes) in order to be allow users to pass those important resource usage factors
to the query, so they are take in account in the final group count suggestion.
This overload is currently only usable when targetting Cuda.
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the wording in the PR description? It talks about max_num_num_work_group_sync, which I guess is an old name for the query?

This overload is currently only usable when targetting Cuda.

What is preventing us from implementing the API on Level Zero? If we cannot implement it on Level Zero, we should add a section "Backend support status" to the spec indicating what is supported. However, it would be better to implement it universally from the start.

Additional comments below.

possible occupancy in a portable way.

List of currently planned queries.
* max_num_work_group_occupancy_per_cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't render well in HTML. Asciidoc requires a blank line before the bullet.

Are you planning to add more queries to this extension soon? This list of currently planned queries seems odd. I'd suggest removing it unless you have some specific plan to add more things.

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on at least one more, being recommended_work_group_size. This would be useful in combination with the currently added one, to let the runtime assist in selecting a configuration for max HW occupancy. This is not super useful in most kernel launch configurations, but for small ones that are not the hot-path where sycl::nd_range is specified explicitly and manual fine-tuning is not required, it is a useful feature. However, I am not adding this yet.

Also, this is not much of a list with one addition at this point, so I am removing it.
Thank you for questioning this!

|Returns the maximum number of actively executing work-groups per compute unit
granularity, when the kernel is submitted to the specified queue with the
specified work-group size and the specified amount of dynamic work-group local
memory (in bytes). The actively executing work-groups are those that occupy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to be a little more detailed about what counts as dynamic work-group local memory. I assume this is the sum of the sizes of all local accessors, right?

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamically allocated (SYCL) local memory, for which the size is know only at runtime and can change between kernel submissions, so yes, you are right. I will elaborate in a little more detail to make it clear.
I can see how it is not clear the way I phrased it.

specified work-group size and the specified amount of dynamic work-group local
memory (in bytes). The actively executing work-groups are those that occupy
the fundamental hardware unit responsible for the execution of work-groups in
parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that max_num_work_group_occupancy_per_cu returns a recommended work-group size? If that is the case, can we rename the query to something like recommended_num_work_groups?

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a recommendation for the maximum number of work-groups (or Cuda blocks) of specified block size etc. that will theoretically execute concurrently on the compute unit (Cuda SM) to achieve maximum occupancy.

I think I like the naming you suggest and it does sound to me like a recommendation. What do you think would be a good name based on my description? (I am sold based on the fact the original I came up with sounds a little weird.)

Thank you, @gmlueck !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to recommended_num_work_groups. Initially, I wanted to indicate that this is not per-device semantics but per compute unit (or whatever this maps to in the HW, i.e. SM for Cuda, EU for Intel Level-Zero or CU for AMD HIP), hence why I had the _per_cu in the name. However, the extension docs describe the semantics, so I think that's okay now.

@@ -0,0 +1,4 @@
// TODO: Revisit 'max_num_work_group_sync' and align it with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7598 has been merged a while ago, this comment should be removed. I also suggest that we deprecate max_num_work_group_sync info trait right away in favor of max_num_work_groups. The former is not documented so we should aim to remove it as soon as possible to avoid wide adoption of it.

Copy link
Contributor Author

@GeorgeWeb GeorgeWeb Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware and was thinking the same. However, I saw it as doing two separate things in one PR that eventually ultimately lands as a single squashed commit and have opted to follow-up with a separate PR solely deprecating max_num_work_group_sync. Of course, I am not arguing against your suggestion and preferences. Having said that, let me know how you'd prefer it done. I am fine either way. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with doing so in a separate PR to have it a separate commit in our history

@@ -159,9 +159,29 @@ class __SYCL_EXPORT kernel : public detail::OwnerLessBase<kernel> {
get_info(const device &Device, const range<3> &WGSize) const;

// TODO: Revisit and align with sycl_ext_oneapi_forward_progress extension
// once #7598 is merged.
// once #7598 is merged. (regarding the 'max_num_work_group_sync' query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, #7598 has been already merged. Any APIs which do not correspond to root group or launch queries extension should be marked as deprecated together with introduction of a proper API that is documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other response wrt this. I see you suggested the introduction of a replacement and deprecating the replaced one go hand-in-hand together. So would you prefer that to happen in separate PR in order for the changes to land in separate squashed commits or you want them in one and just expand on the description?

sycl/include/sycl/kernel.hpp Outdated Show resolved Hide resolved
sycl/source/detail/kernel_impl.cpp Show resolved Hide resolved
sycl/source/detail/kernel_impl.hpp Show resolved Hide resolved
sycl/source/detail/kernel_impl.hpp Outdated Show resolved Hide resolved
sycl/test-e2e/Basic/launch_queries/max_num_work_groups.cpp Outdated Show resolved Hide resolved
@@ -3912,7 +3912,9 @@ _ZNK4sycl3_V16kernel16get_backend_infoINS0_4info6device15backend_versionEEENS0_6
_ZNK4sycl3_V16kernel16get_backend_infoINS0_4info6device7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv
_ZNK4sycl3_V16kernel16get_backend_infoINS0_4info8platform7versionEEENS0_6detail20is_backend_info_descIT_E11return_typeEv
_ZNK4sycl3_V16kernel17get_kernel_bundleEv
_ZNK4sycl3_V16kernel19ext_oneapi_get_infoINS0_3ext6oneapi12experimental4info21kernel_queue_specific23max_num_work_group_syncEEENT_11return_typeERKNS0_5queueE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: this is an ABI break. However, that symbol is from an experimental extension and as I understand it we can do such change even outside of ABI-breaking window. Still, would be good to hear feedback on that from someone else.

@omarahmed1111
Copy link
Contributor

omarahmed1111 commented Sep 10, 2024

@GeorgeWeb UR PR merged, please update UR tag and fix the conflict to merge this, Thanks!

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Sep 11, 2024

Unrelated passing XFAIL: https://github.com/intel/llvm/actions/runs/10809701571/job/29985720662?pr=14333#step:22:2286

********************
Unexpectedly Passed Tests (1):
  SYCL :: OneapiDeviceSelector/no_duplicate_devices.cpp

see #15341 and #15288

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Sep 11, 2024

Wrt the Doxygen docs build failure:
see issue #15355 for the failing Generate Doxygen documentation build action. The link pointing to the now removed kernel-fusion extension is wrong.

@omarahmed1111
Copy link
Contributor

@intel/llvm-gatekeepers Please merge, the failures there are unrelated.

@AlexeySachkov AlexeySachkov merged commit 81aacfa into intel:sycl Sep 11, 2024
12 of 14 checks passed
@AlexeySachkov
Copy link
Contributor

AlexeySachkov commented Sep 11, 2024

Note for other gatekeepers who may come here from failed post-commit. Those failures were noticed, we had already synced with @GeorgeWeb about them and there is a follow-up PR expected with slight tweaks to tests to fix that failure

@GeorgeWeb
Copy link
Contributor Author

As @AlexeySachkov already noted (thanks). The follow-up PR that should fix the failure is this one #15359.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants