Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge OpenAI Triton commit 0a8e3cc #3565

Merged
merged 11 commits into from
Feb 27, 2025
Merged

Merge OpenAI Triton commit 0a8e3cc #3565

merged 11 commits into from
Feb 27, 2025

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Feb 27, 2025

This PR change the Triton base from 63cecbd to 0a8e3cc (Feb 24).
Pass rate: 97.65%->89.74% (#3307)

Please do not squash and merge this PR.

lezcano and others added 9 commits February 24, 2025 18:23
…002)

There were a couple things left to clean up after
triton-lang/triton#5840.
Now we provide a common API in terms of RankedTensorType.
For larger tiles, Ping Pong Scheduler reorders memory and compute
operations into slices. However, the current implementation makes an
incorrect assumption that it is legal/safe to simply move the second dot
product after the current local load prefetch. As a result for
persistent matmul kernels that may contain an epilogue using the result
multiple times in the loop, which results in invalid code.

The proper way to more robustly handle this situation should be to move
the prefetch before the epilogue for these code kernels so that the end
result of the dot product is always available at the same point in the
code.

---------

Co-authored-by: Nick Riasanovsky <[email protected]>
This PR improves the logic regarding warp distribution for FA kernels
1. Always choose warpsPerCTA=[numWarps, 1] for the 1st dot
2. For the 2nd dot, distribute warps along dim0 first, then dim1

This helps register pressure for FA kernel with a large output head
size.
Ops in prologue/epilogue can't get hoisted by LICM after the loop is
flattened, so LICM the outer loop before. We still don't want to LICM
the inner loop because it can significantly increase liveranges.
We need to make sure threads in a pair are both active
and the address is aligned to 4 bytes.

---------

Signed-off-by: Ilya Veselov <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Shucai Xiao <[email protected]>
Adds LLVM Debug messages when the Ping Pong Scheduler
fails to execute due to a reason that is not easily to statically
calculate.

---------

Co-authored-by: Nick Riasanovsky <[email protected]>
This PR supported following cases in dot_scaled:
- mxfp8(both fp8 and bf8) x mxfp8
- mxfp8 x mxfp4 in any order
- scale of either or both operands can be None
This PR factors `TritonIntegerRangeAnalysis` out of `ConvertBufferOps`
into a standalone analysis that can be reused in other passes.
@whitneywhtsang whitneywhtsang marked this pull request as ready for review February 27, 2025 19:49
@whitneywhtsang whitneywhtsang merged commit 1ddca06 into main Feb 27, 2025
8 of 10 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch February 27, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.