[AMD] Introduce specialized Allocation pass #7328

alefimov-amd · 2025-06-26T16:24:11Z

This PR introduces AMD specific allocation pass and new attribute that defines conversion method: padded or swizzled.
For now OptimizeLDSUsage pass sets all convert layout operations in padded mode.

This PR introduces AMD specific allocation pass and new attribute that controlls method scratch pad conversion method: padded or swizzled. For now OptimizeLDSUsage pass sets all convert_layout operations in padded mode.

alefimov-amd · 2025-06-26T17:01:19Z

General swizzling conversion consumes a lot more shared memory, which is a problem on mi30x and older architectures.

Idea is to support both variants in AMD backend:

By default use swizzling pattern
OptimizeLDSUsage pass analyzes lds consumption and can add special operand attribute denoting operations should be padded instead of swizzled
Implement special AMD Allocation pass and convert patterns, which will be applied on operations with given attribute, otherwise fallback to common implementation.

This PR adds only part related to allocation analysis. Conversion pattern implementation is in progress.

ThomasRaoux

I wonder if this is something we should make common to all backends but I have to admit I don't understand how this controls code generation right now

ThomasRaoux · 2025-06-26T17:17:24Z

third_party/amd/lib/Analysis/AMDGPUAllocation.cpp

+}
+
+unsigned AMDAllocationAnalysisScratchSizeFn(Operation *op) {
+  if (op->hasAttr(AttrSharedMemPadded)) {


wouldn't that affect the codegen? I don't see any changes there?

For AMD we still use old pattern, which is using padded memory, so this should be safe.

alefimov-amd · 2025-06-26T17:34:13Z

@ThomasRaoux hi

I wonder if this is something we should make common to all backends but I have to admit I don't understand how this controls code generation right now

This particular PR do not affect codegen, it affects only allocation analysis.
The problem I want to solve in this PR is to align analysis and codegen for AMD.

Analysis pessimistically allocates maximum memory from swizzled and padded memory: https://github.com/triton-lang/triton/blob/main/lib/Analysis/Allocation.cpp#L206

Codegen use old padded pattern for AMD: https://github.com/triton-lang/triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L287

alefimov-amd · 2025-06-26T17:39:57Z

I wonder if this is something we should make common

I am not sure this should be in common code. It seems NVidia backend is mostly using swizzling for everything and gets rid of old pattern, that do not fit linear layout.

+cc @antiagainst

ThomasRaoux · 2025-06-26T17:46:55Z

@ThomasRaoux hi

I wonder if this is something we should make common to all backends but I have to admit I don't understand how this controls code generation right now

This particular PR do not affect codegen, it affects only allocation analysis. The problem I want to solve in this PR is to align analysis and codegen for AMD.

Analysis pessimistically allocates maximum memory from swizzled and padded memory: https://github.com/triton-lang/triton/blob/main/lib/Analysis/Allocation.cpp#L206

Codegen use old padded pattern for AMD: https://github.com/triton-lang/triton/blob/main/lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp#L287

ah I see, so the control should ensure that we use the padded pass in the lowering though? Otherwise we just rely on implicit assumptions.
No need to interrupt your work but maybe @lezcano has some ideas to generalize that a bit. I think it would be good to make it more explicit how the temp allocation is going to be made to avoid relying on implicit handshake.

lezcano · 2025-06-27T01:21:36Z

Yes, in the nvidia backend we are never using padding. We always use swizzling or the stmatrix pass (which I'm working to integrate in the swizzling pass).

I would suggest AMD also uses swizzling if possible given that it's more efficient in general and it will allow to generate swizzled layouts that can be lowered with special instructions like ldmatrix/stmatrix (not sure if AMD has special instructions like these).

If this is not possible, AMD might want to end up moving the padding codegen to the AMD folder as this will not be needed for nvida.

lezcano · 2025-06-27T01:21:53Z

third_party/amd/lib/Analysis/AMDGPUAllocation.cpp

+    auto scratchConfig = getScratchConfigForCvt(srcTy, dstTy);
+    elems = getNumScratchElements(scratchConfig.paddedRepShape);
+  } else {
+    // TODO use swizzling


assert that this path is not taken?

alefimov-amd · 2025-06-27T15:21:04Z

If this is not possible, AMD might want to end up moving the padding codegen to the AMD folder as this will not be needed for nvida.

Right, this is what I want to implement for now. Keep padding related code in AMD backend, at least until we manage to lower memory consumption of swizzling pattern.

antiagainst · 2025-06-27T18:51:06Z

include/triton/Conversion/TritonGPUToLLVM/AllocateSharedMemoryUtility.h

+
+namespace mlir::triton::gpu {
+
+void fillAllocationInfo(ModuleOp mod, ModuleAllocation &allocation);


Maybe name it as attachAllocationSizeAndOffsetAttr is clearer. Also can you add documentation to this function?

antiagainst · 2025-06-27T18:55:11Z

test/Conversion/amd/allocate_shared_memory.mlir

+#blocked1 = #ttg.blocked<{sizePerThread = [8, 4], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
+#blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
+
+// CHECK: ttg.shared = 36864 : i32


Can you explain in comments how this number is computed so easier to understand/update later?

antiagainst · 2025-06-27T18:55:16Z

test/Conversion/amd/allocate_shared_memory.mlir

+#blocked1 = #ttg.blocked<{sizePerThread = [8, 4], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
+#blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
+
+// CHECK: ttg.shared = 131072 : i32


Similary here.

antiagainst · 2025-06-27T19:03:08Z

third_party/amd/include/Analysis/AMDGPUAllocation.h

+
+constexpr char AttrSharedMemPadded[] = "amdgpu.shared_mem_padded";


kPaddedScratchShmemAttrName[] = amdgpu.use_padded_scratch_shmem to be precise?

[AMD] Introduce specialized Allocation pass

9d0553e

This PR introduces AMD specific allocation pass and new attribute that controlls method scratch pad conversion method: padded or swizzled. For now OptimizeLDSUsage pass sets all convert_layout operations in padded mode.

ThomasRaoux reviewed Jun 26, 2025

View reviewed changes

lezcano reviewed Jun 27, 2025

View reviewed changes

add assert for swizzing path

bff8a21

antiagainst requested changes Jun 27, 2025

View reviewed changes

antiagainst marked this pull request as ready for review June 27, 2025 23:46

antiagainst requested review from zhanglx13 and ptillet as code owners June 27, 2025 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD] Introduce specialized Allocation pass #7328

[AMD] Introduce specialized Allocation pass #7328

Uh oh!

alefimov-amd commented Jun 26, 2025

Uh oh!

alefimov-amd commented Jun 26, 2025 •

edited

Loading

Uh oh!

ThomasRaoux left a comment

Uh oh!

ThomasRaoux Jun 26, 2025

Uh oh!

alefimov-amd Jun 26, 2025

Uh oh!

alefimov-amd commented Jun 26, 2025

Uh oh!

alefimov-amd commented Jun 26, 2025

Uh oh!

ThomasRaoux commented Jun 26, 2025

Uh oh!

lezcano commented Jun 27, 2025

Uh oh!

lezcano Jun 27, 2025

Uh oh!

alefimov-amd commented Jun 27, 2025

Uh oh!

antiagainst Jun 27, 2025

Uh oh!

antiagainst Jun 27, 2025

Uh oh!

antiagainst Jun 27, 2025

Uh oh!

antiagainst Jun 27, 2025

Uh oh!

Uh oh!


		namespace mlir::triton::gpu {

		void fillAllocationInfo(ModuleOp mod, ModuleAllocation &allocation);


		constexpr char AttrSharedMemPadded[] = "amdgpu.shared_mem_padded";

[AMD] Introduce specialized Allocation pass #7328

Are you sure you want to change the base?

[AMD] Introduce specialized Allocation pass #7328

Uh oh!

Conversation

alefimov-amd commented Jun 26, 2025

Uh oh!

alefimov-amd commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd commented Jun 26, 2025

Uh oh!

alefimov-amd commented Jun 26, 2025

Uh oh!

ThomasRaoux commented Jun 26, 2025

Uh oh!

lezcano commented Jun 27, 2025

Uh oh!

lezcano Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

alefimov-amd commented Jun 27, 2025

Uh oh!

antiagainst Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

antiagainst Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

antiagainst Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

antiagainst Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alefimov-amd commented Jun 26, 2025 •

edited

Loading