Skip to content

[SYCL] Optimize checkValueRange #18296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: sycl
Choose a base branch
from

Conversation

Pennycook
Copy link
Contributor

checkValueRange is used to determine if an nd_range is compatible with -fsycl-queries-fit-in-int, and is run as part of every kernel launch.

The previous implementation checked the size of each component of the global range, local range, offset, and global range + offset, and also checked the linearized version of each of these values.

The new implementation simplifies these checks, based on the following logic:

  • The linear global range size must be >= every component of the global range.
    If the linear global range fits in int, we don't need to check anything else.

  • Each value in the global range must be >= the value in the local range.
    If the global range fits in int, we don't need to check the local range.

  • There is no need to check offset-related values if the offset is zero.

The new implementation also makes use of __builtin_mul_overflow where available. This shifts the burden of maintaining fast code for these checks to the compiler, and allows us to benefit from aggressive optimizations.

The new implementation could be optimized further if there was a quick way to check whether an nd_range has an offset.

checkValueRange is used to determine if an nd_range is compatible with
-fsycl-queries-fit-in-int, and is run as part of every kernel launch.

The previous implementation checked the size of each component of
the global range, local range, offset, and global range + offset, and also
checked the linearized version of each of these values.

The new implementation simplifies these checks, based on the following logic:

- The linear global range size must be >= every component of the global range.
  If the linear global range fits in int, we don't need to check anything else.

- Each value in the global range must be >= the value in the local range.
  If the global range fits in int, we don't need to check the local range.

- There is no need to check offset-related values if the offset is zero.

The new implementation also makes use of __builtin_mul_overflow where
available. This shifts the burden of maintaining fast code for these checks to
the compiler, and allows us to benefit from aggressive optimizations.

The new implementation could be optimized further if there was a quick way to
check whether an nd_range has an offset.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added the performance Performance related issues label May 2, 2025
@Pennycook Pennycook requested a review from a team as a code owner May 2, 2025 12:53
@Pennycook Pennycook requested a review from uditagarwal97 May 2, 2025 12:53
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 12:54 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 13:26 — with GitHub Actions Inactive
@Pennycook Pennycook marked this pull request as draft May 2, 2025 13:34
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 13:36 — with GitHub Actions Inactive
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 13:36 — with GitHub Actions Inactive
The new implementation generates the same error message regardless of what
caused the overflow, so the test had to be updated with the new message.

I removed one of the tests because it is invalid: SYCL forbids launching
a global size of {1, 1} with a local size of anything other than {1, 1}.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 13:47 — with GitHub Actions Inactive
@Pennycook Pennycook marked this pull request as ready for review May 2, 2025 14:00
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 14:00 — with GitHub Actions Inactive
@Pennycook
Copy link
Contributor Author

A note to reviewers: I had to remove a test, but the test was invalid. In SYCL, the global range must be divisible by the local range, and it makes no sense to enqueue a global range of {1, 1} with a local range that is outside the range of an int.

With my proposed changes, the test as written still throws an exception, it's just a different exception. Depending on the device, it will fail either because: the local range is too large for the device (e.g., PVC is limited to {1024, 1024, 1024}); or because the global range isn't divisible by the local range.

@Pennycook
Copy link
Contributor Author

The failing test looks unrelated to me.

Copy link
Contributor

@uditagarwal97 uditagarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor NITs

Co-authored-by: Udit Kumar Agarwal <[email protected]>
@Pennycook Pennycook temporarily deployed to WindowsCILock May 2, 2025 15:57 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants