Rearchitecture: Xe epilogue #621

petercad · 2025-11-10T19:21:41Z

This PR builds on #573, adding a CollectiveEpilogue with support for the new block 2D copy atoms.

The existing epilogue implementation was mostly rewritten, as it had many hardcoded assumptions and limitations:

Subgroups own a contiguous tile within the workgroup tile
Subgroup tiles are laid out n-major within the workgroup tile
C/D atoms have the same block size
One copy atom of data is processed at a time
C/D atoms must bring data in the exact same layout as the accumulator

The new implementation removes all these restrictions.

Its API is also somewhat different, mostly in ways that more closely match the SM90 epilogues:

Configurable EpilogueTile template parameter controls the block size for epilogue computation.
Fusion callbacks receive workgroup-scope tiling information, not subgroup-scope tiling information (because CuTe's TiledMMA is very flexible -- the subgroup "tile" may not be contiguous).
Vectorization for the epilogue compute operations is configurable via the ComputeVectorLen constexpr variable. Currently this is set to operate on one MMA atom's worth of accumulator data at a time, but if we want to make it user-configurable like the NV epilogues (where it's a template parameter for the dispatch policy), that's possible.
It receives the TiledMMA as a template parameter rather than an argument to operator().
The S2R/R2S copy operation parameters are omitted (a difference vs. SM90) as they are irrelevant to both the old and new epilogue implementation.

The new implementation glues together C/D loads and compute with reorders, so it can support efficient data type and layout conversions outside of the epilogue computation.

rolandschulz · 2025-11-10T20:30:19Z

include/cute/tensor_sg.hpp

+{
+  static_assert(is_static_v<SubgroupTVLayout>, "Subgroup TV layout must be static");
+  static_assert(is_rmem_v<Engine>, "Expected an rmem tensor");
+  return make_subgroup_tensor(make_tensor(tensor.data(), tensor.layout()), tv_layout);


isn't this the same as the static_cast? why do you use static_cast in one case and this in the other?

Using a static_cast here induces a copy of the tensor data, rather than reusing the existing data, which is the intention here.

rolandschulz · 2025-11-10T20:30:29Z

include/cute/tensor_sg.hpp

+          __CUTE_REQUIRES(is_layout<SubgroupTVLayout>::value)>
+CUTE_HOST_DEVICE
+constexpr decltype(auto)
+make_subgroup_tensor(Tensor<Engine,Layout>&& tensor, SubgroupTVLayout const&)


why does this take a rvalue-ref?

The idea is to introduce two flavors of make_subgroup_tensor. Given an lvalue reference, it makes a view of an existing rmem Tensor. Given an rvalue reference, it assumes ownership of the incoming Tensor's data.

rolandschulz · 2025-11-10T20:34:00Z

include/cutlass/epilogue/collective/xe_epilogue.hpp

-
-  constexpr static bool is_m_major_C = detail::is_m_major<StrideC>();
-  constexpr static bool is_m_major_D = detail::is_m_major<StrideD>();
+  constexpr static bool is_source_supported      = !is_void_v<ElementC>;


Nit: Nvidia uses "_needed" instead of "_supported". I think that's a better name.

I agree for destination. For source, it also depends on the fusion -- that check happens in operator().

taozha2 · 2025-11-11T01:04:54Z

examples/00_bmg_gemm/00_bmg_gemm.cpp

          decltype(tile_shape(TiledMma()))>;
  // GEMM Epilogue - loads & stores C/D matrices, performs epilogue operations & load/stores any
  // auxiliary data required
  using CollectiveEpilogue = cutlass::epilogue::collective::CollectiveEpilogue<
          EpilogueDispatchPolicy,
-          TileShape,
+          TiledMma,


This changed epilogue API, and bind the MMA to epilogue, I think epilogue is an independent component which is not only for MMA, right? and what will happen if "TiledMma == void" here?

The epilogue already requires the TiledMMA. Before it was an argument to operator() -- in this PR I made it a template parameter to the epilogue itself. Either way we must have the TiledMMA to understand what data the accumulator contains so we know where to write it to global memory.

petercad force-pushed the petercad/new_epilogue branch from cc5b141 to 09f9739 Compare November 10, 2025 19:22

petercad mentioned this pull request Nov 10, 2025

Use newer version of copy_atom in epilogue collective #573

Merged

petercad force-pushed the petercad/new_epilogue branch from 09f9739 to f6f793e Compare November 10, 2025 19:35

rolandschulz reviewed Nov 10, 2025

View reviewed changes

tdeng5 requested review from jiyang1011, taozha2 and tdeng5 November 11, 2025 00:51

taozha2 reviewed Nov 11, 2025

View reviewed changes

tdeng5 added the release label Nov 11, 2025

[CuTe] [Xe] Reuse data in make_subgroup_tensor

e85351c

petercad force-pushed the petercad/new_epilogue branch from f6f793e to f43ee5f Compare November 12, 2025 17:08

petercad added 2 commits November 12, 2025 10:38

[CUTLASS] [Epilogue] Flexible Xe epilogue for new atoms

1d2306e

[Examples] 00_bmg_gemm: use new epilogue

9cf2998

petercad force-pushed the petercad/new_epilogue branch from f43ee5f to 9cf2998 Compare November 12, 2025 18:38

tdeng5 removed the release label Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rearchitecture: Xe epilogue #621

Rearchitecture: Xe epilogue #621

Uh oh!

petercad commented Nov 10, 2025 •

edited

Loading

Uh oh!

rolandschulz Nov 10, 2025

Uh oh!

petercad Nov 10, 2025

Uh oh!

rolandschulz Nov 10, 2025

Uh oh!

petercad Nov 10, 2025

Uh oh!

rolandschulz Nov 10, 2025

Uh oh!

petercad Nov 10, 2025

Uh oh!

taozha2 Nov 11, 2025

Uh oh!

petercad Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rearchitecture: Xe epilogue #621

Are you sure you want to change the base?

Rearchitecture: Xe epilogue #621

Uh oh!

Conversation

petercad commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rolandschulz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

taozha2 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

petercad commented Nov 10, 2025 •

edited

Loading

petercad Nov 12, 2025 •

edited

Loading