-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge OpenAI Triton commit 0a8e3cc
#3565
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…002) There were a couple things left to clean up after triton-lang/triton#5840. Now we provide a common API in terms of RankedTensorType.
For larger tiles, Ping Pong Scheduler reorders memory and compute operations into slices. However, the current implementation makes an incorrect assumption that it is legal/safe to simply move the second dot product after the current local load prefetch. As a result for persistent matmul kernels that may contain an epilogue using the result multiple times in the loop, which results in invalid code. The proper way to more robustly handle this situation should be to move the prefetch before the epilogue for these code kernels so that the end result of the dot product is always available at the same point in the code. --------- Co-authored-by: Nick Riasanovsky <[email protected]>
This PR improves the logic regarding warp distribution for FA kernels 1. Always choose warpsPerCTA=[numWarps, 1] for the 1st dot 2. For the 2nd dot, distribute warps along dim0 first, then dim1 This helps register pressure for FA kernel with a large output head size.
Ops in prologue/epilogue can't get hoisted by LICM after the loop is flattened, so LICM the outer loop before. We still don't want to LICM the inner loop because it can significantly increase liveranges.
We need to make sure threads in a pair are both active and the address is aligned to 4 bytes. --------- Signed-off-by: Ilya Veselov <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Shucai Xiao <[email protected]>
Adds LLVM Debug messages when the Ping Pong Scheduler fails to execute due to a reason that is not easily to statically calculate. --------- Co-authored-by: Nick Riasanovsky <[email protected]>
This PR supported following cases in dot_scaled: - mxfp8(both fp8 and bf8) x mxfp8 - mxfp8 x mxfp4 in any order - scale of either or both operands can be None
This PR factors `TritonIntegerRangeAnalysis` out of `ConvertBufferOps` into a standalone analysis that can be reused in other passes.
pbchekin
approved these changes
Feb 27, 2025
anmyachev
approved these changes
Feb 27, 2025
0d80173
to
d2d8be2
Compare
Signed-off-by: Whitney Tsang <[email protected]>
4f7f374
to
61625fc
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR change the Triton base from 63cecbd to 0a8e3cc (Feb 24).
Pass rate: 97.65%->89.74% (#3307)
Please do not squash and merge this PR.