Skip to content

[Backend] Emit bar.warp.sync for barriers of 1 warp #7336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 27, 2025
Merged

Conversation

Mogball
Copy link
Collaborator

@Mogball Mogball commented Jun 27, 2025

In warp specialized regions with only 1 warp, we can emit bar.warp.sync instead of barriers with a threadcount. This is slightly more efficient.

In warp specialized regions with only 1 warp, we can emit
`bar.warp.sync` instead of barriers with a threadcount. This is slightly
more efficient.
@Mogball Mogball requested a review from ThomasRaoux June 27, 2025 00:40
@Mogball Mogball requested a review from ptillet as a code owner June 27, 2025 00:40
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Mogball
Copy link
Collaborator Author

Mogball commented Jun 27, 2025

Clearly something is wrong with this PR. It's not that important to dig into at the moment.

@Mogball Mogball closed this Jun 27, 2025
@peterbell10
Copy link
Contributor

E RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

I have seen the same issue on other PRs. I think the worker is flaky :/

@ThomasRaoux
Copy link
Collaborator

E RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

I have seen the same issue on other PRs. I think the worker is flaky :/

GB200 has a lot of instability, sorry about that

@Mogball
Copy link
Collaborator Author

Mogball commented Jun 27, 2025

Oh, well I guess this PR got really unlucky? It didn't happen on other PRs I opened.

@Mogball Mogball reopened this Jun 27, 2025
@Mogball
Copy link
Collaborator Author

Mogball commented Jun 27, 2025

Let's try again...

@Jokeren
Copy link
Contributor

Jokeren commented Jun 27, 2025

I wonder why it's more efficient using bar.warp.sync? Is it just because barriers instructions are slower?

@Jokeren
Copy link
Contributor

Jokeren commented Jun 27, 2025

Or it's just a measurement noise?

@peterbell10
Copy link
Contributor

bar.warp.sync becomes a no-op when ptxas knows statically that the warp is convergent, which is probably a lot of the time in triton. See in this example, the second bar.warp.sync becomes BSYNC because it comes after a branch but there isn't one emitted before the first branch:
https://gcc.godbolt.org/z/3rMPaenG9

@Mogball Mogball merged commit 677a30c into main Jun 27, 2025
16 of 18 checks passed
@Mogball Mogball deleted the mogball/ws branch June 27, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants