[AMD][NOMERGE] fails to fail to pipeline #7330

makslevental · 2025-06-26T17:25:52Z

The changed test is too permissive - it should verify that the loop is not actually pipelined:

https://github.com/triton-lang/triton/blob/main/test/TritonGPU/loop-pipeline.mlir#L909-L911

    // check that the load didn't get pipelined.
    // COMMON-NOT: alloc
    // COMMON: scf.for

but the loop is pipelined but with a global load:

tt.func @load_two_users_incompatible_layouts(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) -> (tensor<128x16xf32, #mma>, tensor<128x64xf32, #mma>) {
  %c7_i32 = arith.constant 7 : i32
  %c-1_i32 = arith.constant -1 : i32
  %c0_i32 = arith.constant 0 : i32
  %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma>
  %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
  %c1_i32 = arith.constant 1 : i32
  %0 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
  %1 = tt.expand_dims %0 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi32, #blocked>
  %2 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x64x!tt.ptr<f16>, #blocked>
  %3 = tt.broadcast %1 : tensor<1x64xi32, #blocked> -> tensor<128x64xi32, #blocked>
  %4 = tt.addptr %2, %3 : tensor<128x64x!tt.ptr<f16>, #blocked>, tensor<128x64xi32, #blocked>
  %5 = tt.load %4 : tensor<128x64x!tt.ptr<f16>, #blocked>
  %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
  %7 = tt.expand_dims %6 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
  %8 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x16x!tt.ptr<f16>, #blocked1>
  %9 = tt.broadcast %7 : tensor<64x1xi32, #blocked1> -> tensor<64x16xi32, #blocked1>
  %10 = tt.addptr %8, %9 : tensor<64x16x!tt.ptr<f16>, #blocked1>, tensor<64x16xi32, #blocked1>
  // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  // !!!!!!!!!! PIPELINED GLOBAL LOAD !!!!!!!!!!!!!!
  // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  %11 = tt.load %10 : tensor<64x16x!tt.ptr<f16>, #blocked1>
  %12:3 = scf.for %arg2 = %c0_i32 to %c7_i32 step %c1_i32 iter_args(%arg3 = %cst_0, %arg4 = %c-1_i32, %arg5 = %11) -> (tensor<128x64xf32, #mma>, i32, tensor<64x16xf16, #blocked1>)  : i32 {
    %22 = arith.addi %arg4, %c1_i32 : i32
    %23 = arith.cmpi slt, %22, %c1_i32 : i32
    %24 = arith.select %23, %22, %c0_i32 : i32
    %25 = tt.load %10 : tensor<64x16x!tt.ptr<f16>, #blocked1>
    %26 = ttg.convert_layout %5 : tensor<128x64xf16, #blocked> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
    %27 = ttg.convert_layout %arg5 : tensor<64x16xf16, #blocked1> -> tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
    %28 = tt.dot %26, %27, %cst : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
    %29 = arith.truncf %28 : tensor<128x16xf32, #mma> to tensor<128x16xf16, #mma>
    %30 = ttg.convert_layout %29 : tensor<128x16xf16, #mma> -> tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
    %31 = ttg.local_alloc %arg5 : (tensor<64x16xf16, #blocked1>) -> !ttg.memdesc<64x16xf16, #shared, #smem>
    %32 = ttg.memdesc_trans %31 {order = array<i32: 1, 0>} : !ttg.memdesc<64x16xf16, #shared, #smem> -> !ttg.memdesc<16x64xf16, #shared1, #smem>
    %33 = ttg.local_load %32 : !ttg.memdesc<16x64xf16, #shared1, #smem> -> tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
    %34 = tt.dot %30, %33, %arg3 : tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
    scf.yield %34, %24, %25 : tensor<128x64xf32, #mma>, i32, tensor<64x16xf16, #blocked1>
  }
  %13 = ttg.convert_layout %5 : tensor<128x64xf16, #blocked> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
  %14 = ttg.convert_layout %12#2 : tensor<64x16xf16, #blocked1> -> tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
  %15 = tt.dot %13, %14, %cst : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
  %16 = arith.truncf %15 : tensor<128x16xf32, #mma> to tensor<128x16xf16, #mma>
  %17 = ttg.convert_layout %16 : tensor<128x16xf16, #mma> -> tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
  %18 = ttg.local_alloc %12#2 : (tensor<64x16xf16, #blocked1>) -> !ttg.memdesc<64x16xf16, #shared, #smem>
  %19 = ttg.memdesc_trans %18 {order = array<i32: 1, 0>} : !ttg.memdesc<64x16xf16, #shared, #smem> -> !ttg.memdesc<16x64xf16, #shared1, #smem>
  %20 = ttg.local_load %19 : !ttg.memdesc<16x64xf16, #shared1, #smem> -> tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
  %21 = tt.dot %17, %20, %12#0 : tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
  tt.return %15, %21 : tensor<128x16xf32, #mma>, tensor<128x64xf32, #mma>
}

This is fortunately fixed in #7222 where we move to using the core canHaveSharedEncoding inside of isPipeliningBeneficial.

makslevental · 2025-06-26T18:05:17Z

Just to be clear on my reasoning here: the first RUN line in this lit test does the correct thing (succeeds in not pipelining the loop at all)

// RUN: triton-opt %s -split-input-file -tritongpu-assign-latencies \
    -tritongpu-schedule-loops -tritongpu-pipeline=num-stages=3 -canonicalize

i.e the core pipeliner does the correct thing and AMD's pipeliner does not.

[AMD] fails to fail to pipeline

39bcc4a

makslevental force-pushed the makslevental/broken-pipeline branch from 04de539 to 39bcc4a Compare June 26, 2025 17:34

makslevental changed the title ~~[AMD] fails to fail to pipeline~~ [AMD][NOMERGE] fails to fail to pipeline Jun 26, 2025

makslevental closed this Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD][NOMERGE] fails to fail to pipeline #7330

[AMD][NOMERGE] fails to fail to pipeline #7330

Uh oh!

makslevental commented Jun 26, 2025 •

edited

Loading

Uh oh!

makslevental commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

[AMD][NOMERGE] fails to fail to pipeline #7330

[AMD][NOMERGE] fails to fail to pipeline #7330

Uh oh!

Conversation

makslevental commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makslevental commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

makslevental commented Jun 26, 2025 •

edited

Loading

makslevental commented Jun 26, 2025 •

edited

Loading