-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[WIP][AutoWS] Improved partition scheduling pass #7312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…heuristic driven partition merging
Hey, thanks so much for sharing this. I haven't looked at the code in detail, but the general approach of building a dataflow structure onto which "heuristics" can be applied as patterns conceptually makes sense to me. The ad-hoc partitioner I built in main applies a similar technique whereby it makes a first-pass assignment of ops to partitions based on some simple rules, then it assigns each remaining op into its own clusters and merges+splits those clusters based on some simple rules. I have two high-level comments on this:
|
I think it could very easily be limited in scope, e.g. to apply to each for op, by just constructing the data flow graph for that subset of the program. However, why do we not want perform it as a whole-program analysis? For example, for a non-persistent GEMM with an epilog outside of the loop, limiting the analysis to just the loop would not pick up the epilog.
I think the main utility of this new pass is that it can assign individual for op iter args and if op results to warp groups, which I think will mainly be useful to the aref based passes which have not yet been merged into main. I think it is best to wait until merging those parts has progressed somewhat before enabling this pass. |
It's not really an issue if the epilogue isn't picked up. It will still be executed after the warp specialized loop. In general, limiting this to a local analysis is critical for composability with the rest of the compiler. Also, if you have multiple loops in the program with wildly different loop bodies, it will be very difficult to come up with a single partitioning scheme for the whole program. It is much easier to consider them on a case-by-case basis since it breaks up the problem. |
big +1 what Jeff said. I don't understand what it would mean to include an epilogue that is not in a loop for WS. The point of WS is to overlap work, if it is outside the loop I don't think there is anything to overlap. |
Add new automatic warp specialization partition analysis pass based on data flow graph and incremental, heuristic driven partition merging.
The aim of this is to provide a more general approach for partition scheduling.
Note this is not ready for review. Just posting for exposure for those interested.
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsFILL THIS IN
.Select one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)