Skip to content

Conversation

@petercad
Copy link

@petercad petercad commented Nov 10, 2025

This PR builds on #573, adding a CollectiveEpilogue with support for the new block 2D copy atoms.

The existing epilogue implementation was mostly rewritten, as it had many hardcoded assumptions and limitations:

  • Subgroups own a contiguous tile within the workgroup tile
  • Subgroup tiles are laid out n-major within the workgroup tile
  • C/D atoms have the same block size
  • One copy atom of data is processed at a time
  • C/D atoms must bring data in the exact same layout as the accumulator

The new implementation removes all these restrictions.

Its API is also somewhat different, mostly in ways that more closely match the SM90 epilogues:

  • Configurable EpilogueTile template parameter controls the block size for epilogue computation.
  • Fusion callbacks receive workgroup-scope tiling information, not subgroup-scope tiling information (because CuTe's TiledMMA is very flexible -- the subgroup "tile" may not be contiguous).
  • Vectorization for the epilogue compute operations is configurable via the ComputeVectorLen constexpr variable. Currently this is set to operate on one MMA atom's worth of accumulator data at a time, but if we want to make it user-configurable like the NV epilogues (where it's a template parameter for the dispatch policy), that's possible.
  • It receives the TiledMMA as a template parameter rather than an argument to operator().
  • The S2R/R2S copy operation parameters are omitted (a difference vs. SM90) as they are irrelevant to both the old and new epilogue implementation.

The new implementation glues together C/D loads and compute with reorders, so it can support efficient data type and layout conversions outside of the epilogue computation.

{
static_assert(is_static_v<SubgroupTVLayout>, "Subgroup TV layout must be static");
static_assert(is_rmem_v<Engine>, "Expected an rmem tensor");
return make_subgroup_tensor(make_tensor(tensor.data(), tensor.layout()), tv_layout);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as the static_cast? why do you use static_cast in one case and this in the other?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a static_cast here induces a copy of the tensor data, rather than reusing the existing data, which is the intention here.

__CUTE_REQUIRES(is_layout<SubgroupTVLayout>::value)>
CUTE_HOST_DEVICE
constexpr decltype(auto)
make_subgroup_tensor(Tensor<Engine,Layout>&& tensor, SubgroupTVLayout const&)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this take a rvalue-ref?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to introduce two flavors of make_subgroup_tensor. Given an lvalue reference, it makes a view of an existing rmem Tensor. Given an rvalue reference, it assumes ownership of the incoming Tensor's data.


constexpr static bool is_m_major_C = detail::is_m_major<StrideC>();
constexpr static bool is_m_major_D = detail::is_m_major<StrideD>();
constexpr static bool is_source_supported = !is_void_v<ElementC>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Nvidia uses "_needed" instead of "_supported". I think that's a better name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for destination. For source, it also depends on the fusion -- that check happens in operator().

decltype(tile_shape(TiledMma()))>;
// GEMM Epilogue - loads & stores C/D matrices, performs epilogue operations & load/stores any
// auxiliary data required
using CollectiveEpilogue = cutlass::epilogue::collective::CollectiveEpilogue<
EpilogueDispatchPolicy,
TileShape,
TiledMma,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changed epilogue API, and bind the MMA to epilogue, I think epilogue is an independent component which is not only for MMA, right? and what will happen if "TiledMma == void" here?

Copy link
Author

@petercad petercad Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epilogue already requires the TiledMMA. Before it was an argument to operator() -- in this PR I made it a template parameter to the epilogue itself. Either way we must have the TiledMMA to understand what data the accumulator contains so we know where to write it to global memory.

@tdeng5 tdeng5 added the release label Nov 11, 2025
@petercad petercad force-pushed the petercad/new_epilogue branch from f6f793e to f43ee5f Compare November 12, 2025 17:08
@petercad petercad force-pushed the petercad/new_epilogue branch from f43ee5f to 9cf2998 Compare November 12, 2025 18:38
@tdeng5 tdeng5 removed the release label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants