-
Notifications
You must be signed in to change notification settings - Fork 68
Rearchitecture: Xe epilogue #621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc5b141 to
09f9739
Compare
09f9739 to
f6f793e
Compare
| { | ||
| static_assert(is_static_v<SubgroupTVLayout>, "Subgroup TV layout must be static"); | ||
| static_assert(is_rmem_v<Engine>, "Expected an rmem tensor"); | ||
| return make_subgroup_tensor(make_tensor(tensor.data(), tensor.layout()), tv_layout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same as the static_cast? why do you use static_cast in one case and this in the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a static_cast here induces a copy of the tensor data, rather than reusing the existing data, which is the intention here.
| __CUTE_REQUIRES(is_layout<SubgroupTVLayout>::value)> | ||
| CUTE_HOST_DEVICE | ||
| constexpr decltype(auto) | ||
| make_subgroup_tensor(Tensor<Engine,Layout>&& tensor, SubgroupTVLayout const&) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this take a rvalue-ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to introduce two flavors of make_subgroup_tensor. Given an lvalue reference, it makes a view of an existing rmem Tensor. Given an rvalue reference, it assumes ownership of the incoming Tensor's data.
|
|
||
| constexpr static bool is_m_major_C = detail::is_m_major<StrideC>(); | ||
| constexpr static bool is_m_major_D = detail::is_m_major<StrideD>(); | ||
| constexpr static bool is_source_supported = !is_void_v<ElementC>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Nvidia uses "_needed" instead of "_supported". I think that's a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree for destination. For source, it also depends on the fusion -- that check happens in operator().
| decltype(tile_shape(TiledMma()))>; | ||
| // GEMM Epilogue - loads & stores C/D matrices, performs epilogue operations & load/stores any | ||
| // auxiliary data required | ||
| using CollectiveEpilogue = cutlass::epilogue::collective::CollectiveEpilogue< | ||
| EpilogueDispatchPolicy, | ||
| TileShape, | ||
| TiledMma, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changed epilogue API, and bind the MMA to epilogue, I think epilogue is an independent component which is not only for MMA, right? and what will happen if "TiledMma == void" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The epilogue already requires the TiledMMA. Before it was an argument to operator() -- in this PR I made it a template parameter to the epilogue itself. Either way we must have the TiledMMA to understand what data the accumulator contains so we know where to write it to global memory.
f6f793e to
f43ee5f
Compare
f43ee5f to
9cf2998
Compare
This PR builds on #573, adding a
CollectiveEpiloguewith support for the new block 2D copy atoms.The existing epilogue implementation was mostly rewritten, as it had many hardcoded assumptions and limitations:
The new implementation removes all these restrictions.
Its API is also somewhat different, mostly in ways that more closely match the SM90 epilogues:
ComputeVectorLenconstexpr variable. Currently this is set to operate on one MMA atom's worth of accumulator data at a time, but if we want to make it user-configurable like the NV epilogues (where it's a template parameter for the dispatch policy), that's possible.operator().The new implementation glues together C/D loads and compute with reorders, so it can support efficient data type and layout conversions outside of the epilogue computation.