Skip to content

[WIP][AutoWS] Improved partition scheduling pass #7312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

acollins3
Copy link
Contributor

@acollins3 acollins3 commented Jun 25, 2025

Add new automatic warp specialization partition analysis pass based on data flow graph and incremental, heuristic driven partition merging.

The aim of this is to provide a more general approach for partition scheduling.

Note this is not ready for review. Just posting for exposure for those interested.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@acollins3 acollins3 requested a review from ptillet as a code owner June 25, 2025 16:32
@acollins3 acollins3 marked this pull request as draft June 25, 2025 16:58
@Mogball
Copy link
Collaborator

Mogball commented Jun 26, 2025

Hey, thanks so much for sharing this. I haven't looked at the code in detail, but the general approach of building a dataflow structure onto which "heuristics" can be applied as patterns conceptually makes sense to me. The ad-hoc partitioner I built in main applies a similar technique whereby it makes a first-pass assignment of ops to partitions based on some simple rules, then it assigns each remaining op into its own clusters and merges+splits those clusters based on some simple rules.

I have two high-level comments on this:

  • Warp specialization should be a local transformation, not a whole-program transformation. Would it be possible to bound the partitioner to a single loop nest, or N adjacent loop nests?
  • In terms of planning, what is the timeline for altering main to use this new partition representation?

@acollins3
Copy link
Contributor Author

  • Warp specialization should be a local transformation, not a whole-program transformation. Would it be possible to bound the partitioner to a single loop nest, or N adjacent loop nests?

I think it could very easily be limited in scope, e.g. to apply to each for op, by just constructing the data flow graph for that subset of the program.

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

  • In terms of planning, what is the timeline for altering main to use this new partition representation?

I think the main utility of this new pass is that it can assign individual for op iter args and if op results to warp groups, which I think will mainly be useful to the aref based passes which have not yet been merged into main. I think it is best to wait until merging those parts has progressed somewhat before enabling this pass.

@Mogball
Copy link
Collaborator

Mogball commented Jun 27, 2025

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

It's not really an issue if the epilogue isn't picked up. It will still be executed after the warp specialized loop.

In general, limiting this to a local analysis is critical for composability with the rest of the compiler. Also, if you have multiple loops in the program with wildly different loop bodies, it will be very difficult to come up with a single partitioning scheme for the whole program. It is much easier to consider them on a case-by-case basis since it breaks up the problem.

@ThomasRaoux
Copy link
Collaborator

ThomasRaoux commented Jun 27, 2025

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

It's not really an issue if the epilogue isn't picked up. It will still be executed after the warp specialized loop.

In general, limiting this to a local analysis is critical for composability with the rest of the compiler. Also, if you have multiple loops in the program with wildly different loop bodies, it will be very difficult to come up with a single partitioning scheme for the whole program. It is much easier to consider them on a case-by-case basis since it breaks up the problem.

big +1 what Jeff said. I don't understand what it would mean to include an epilogue that is not in a loop for WS. The point of WS is to overlap work, if it is outside the loop I don't think there is anything to overlap.
I think it is important that we are on the same page and see warp specialization as a loop pipelining optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants