[WIP][AutoWS] Improved partition scheduling pass #7312

acollins3 · 2025-06-25T16:32:37Z

Add new automatic warp specialization partition analysis pass based on data flow graph and incremental, heuristic driven partition merging.

The aim of this is to provide a more general approach for partition scheduling.

Note this is not ready for review. Just posting for exposure for those interested.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

…heuristic driven partition merging

Mogball · 2025-06-26T17:51:31Z

Hey, thanks so much for sharing this. I haven't looked at the code in detail, but the general approach of building a dataflow structure onto which "heuristics" can be applied as patterns conceptually makes sense to me. The ad-hoc partitioner I built in main applies a similar technique whereby it makes a first-pass assignment of ops to partitions based on some simple rules, then it assigns each remaining op into its own clusters and merges+splits those clusters based on some simple rules.

I have two high-level comments on this:

Warp specialization should be a local transformation, not a whole-program transformation. Would it be possible to bound the partitioner to a single loop nest, or N adjacent loop nests?
In terms of planning, what is the timeline for altering main to use this new partition representation?

acollins3 · 2025-06-27T17:17:16Z

Warp specialization should be a local transformation, not a whole-program transformation. Would it be possible to bound the partitioner to a single loop nest, or N adjacent loop nests?

I think it could very easily be limited in scope, e.g. to apply to each for op, by just constructing the data flow graph for that subset of the program.

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

In terms of planning, what is the timeline for altering main to use this new partition representation?

I think the main utility of this new pass is that it can assign individual for op iter args and if op results to warp groups, which I think will mainly be useful to the aref based passes which have not yet been merged into main. I think it is best to wait until merging those parts has progressed somewhat before enabling this pass.

Mogball · 2025-06-27T17:44:30Z

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

It's not really an issue if the epilogue isn't picked up. It will still be executed after the warp specialized loop.

In general, limiting this to a local analysis is critical for composability with the rest of the compiler. Also, if you have multiple loops in the program with wildly different loop bodies, it will be very difficult to come up with a single partitioning scheme for the whole program. It is much easier to consider them on a case-by-case basis since it breaks up the problem.

ThomasRaoux · 2025-06-27T18:01:49Z

However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.

It's not really an issue if the epilogue isn't picked up. It will still be executed after the warp specialized loop.

In general, limiting this to a local analysis is critical for composability with the rest of the compiler. Also, if you have multiple loops in the program with wildly different loop bodies, it will be very difficult to come up with a single partitioning scheme for the whole program. It is much easier to consider them on a case-by-case basis since it breaks up the problem.

big +1 what Jeff said. I don't understand what it would mean to include an epilogue that is not in a loop for WS. The point of WS is to overlap work, if it is outside the loop I don't think there is anything to overlap.
I think it is important that we are on the same page and see warp specialization as a loop pipelining optimization.

Add new partition analysis based on data flow graph and incremental, …

9636c4a

…heuristic driven partition merging

acollins3 requested a review from ptillet as a code owner June 25, 2025 16:32

acollins3 marked this pull request as draft June 25, 2025 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][AutoWS] Improved partition scheduling pass #7312

[WIP][AutoWS] Improved partition scheduling pass #7312

Uh oh!

acollins3 commented Jun 25, 2025 •

edited

Loading

Uh oh!

Mogball commented Jun 26, 2025

Uh oh!

acollins3 commented Jun 27, 2025

Uh oh!

Mogball commented Jun 27, 2025

Uh oh!

ThomasRaoux commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP][AutoWS] Improved partition scheduling pass #7312

Are you sure you want to change the base?

[WIP][AutoWS] Improved partition scheduling pass #7312

Uh oh!

Conversation

acollins3 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New contributor declaration

Uh oh!

Mogball commented Jun 26, 2025

Uh oh!

acollins3 commented Jun 27, 2025

Uh oh!

Mogball commented Jun 27, 2025

Uh oh!

ThomasRaoux commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

acollins3 commented Jun 25, 2025 •

edited

Loading

ThomasRaoux commented Jun 27, 2025 •

edited

Loading