Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Implement urKernelSuggestMaxCooperativeGroupCountExp for Cuda #1796

Merged

Conversation

GeorgeWeb
Copy link
Contributor

@GeorgeWeb GeorgeWeb commented Jun 27, 2024

This commit implements the experimental urKernelSuggestMaxCooperativeGroupCountExp, for the Cuda adapter, to retrieve the maximum number of cooperative groups that can be launched on the device.

Additionally, the changes also cache the result of the CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT Cuda driver query which is used to calculate the device wide maximum cooperative groups, because the Cuda occupancy query used has per SM (Multiprocessor) semantics.

Testing and related changes enabling querying this from SYCL: intel/llvm#14333

@GeorgeWeb GeorgeWeb requested a review from a team as a code owner June 27, 2024 13:16
@github-actions github-actions bot added the cuda CUDA adapter specific issues label Jun 27, 2024
Copy link
Contributor

@konradkusiak97 konradkusiak97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM.

@pbalcer
Copy link
Contributor

pbalcer commented Jun 27, 2024

2024-06-27T14:32:19.4840797Z Failed Tests (1):
2024-06-27T14:32:19.4849324Z   SYCL :: GroupAlgorithm/root_group.cpp

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Jun 27, 2024

2024-06-27T14:32:19.4840797Z Failed Tests (1):
2024-06-27T14:32:19.4849324Z   SYCL :: GroupAlgorithm/root_group.cpp

@pbalcer Yeah aware, thanks! The root group barrier is currently not supported correctly for cooperative-group kernels in the CUDA backend, so the intel/llvm corresponding PR will be XFAIL-ing it until it is implemented.

It previously passed because the query was returning a single group and it was calling a work-group level barrier rather than device-wide (cross-work-group).

…ter backend from the sycl runtime

This change is required in order to implement per-device semantics for the
urKernelSuggestMaxCooperativeGroupCountExp query.
@GeorgeWeb GeorgeWeb force-pushed the georgi/ur_kernel_max_active_wgs branch from 9dcdc62 to 45a781f Compare September 6, 2024 10:11
@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Sep 6, 2024

After last rebase, there's a:

SYCL :: Regression/device_num.cpp

e2e failure that seems unrelated.

@GeorgeWeb GeorgeWeb added the ready to merge Added to PR's which are ready to merge label Sep 6, 2024
@omarahmed1111 omarahmed1111 merged commit eb63d1a into oneapi-src:main Sep 10, 2024
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA adapter specific issues experimental Experimental feature additions/changes/specification ready to merge Added to PR's which are ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants