Skip to content

[AMD][NOMERGE] fails to fail to pipeline #7330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

makslevental
Copy link
Contributor

@makslevental makslevental commented Jun 26, 2025

The changed test is too permissive - it should verify that the loop is not actually pipelined:

https://github.com/triton-lang/triton/blob/main/test/TritonGPU/loop-pipeline.mlir#L909-L911

    // check that the load didn't get pipelined.
    // COMMON-NOT: alloc
    // COMMON: scf.for

but the loop is pipelined but with a global load:

tt.func @load_two_users_incompatible_layouts(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) -> (tensor<128x16xf32, #mma>, tensor<128x64xf32, #mma>) {
  %c7_i32 = arith.constant 7 : i32
  %c-1_i32 = arith.constant -1 : i32
  %c0_i32 = arith.constant 0 : i32
  %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma>
  %cst_0 = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
  %c1_i32 = arith.constant 1 : i32
  %0 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
  %1 = tt.expand_dims %0 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi32, #blocked>
  %2 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<128x64x!tt.ptr<f16>, #blocked>
  %3 = tt.broadcast %1 : tensor<1x64xi32, #blocked> -> tensor<128x64xi32, #blocked>
  %4 = tt.addptr %2, %3 : tensor<128x64x!tt.ptr<f16>, #blocked>, tensor<128x64xi32, #blocked>
  %5 = tt.load %4 : tensor<128x64x!tt.ptr<f16>, #blocked>
  %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
  %7 = tt.expand_dims %6 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
  %8 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<64x16x!tt.ptr<f16>, #blocked1>
  %9 = tt.broadcast %7 : tensor<64x1xi32, #blocked1> -> tensor<64x16xi32, #blocked1>
  %10 = tt.addptr %8, %9 : tensor<64x16x!tt.ptr<f16>, #blocked1>, tensor<64x16xi32, #blocked1>
  // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  // !!!!!!!!!! PIPELINED GLOBAL LOAD !!!!!!!!!!!!!!
  // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  %11 = tt.load %10 : tensor<64x16x!tt.ptr<f16>, #blocked1>
  %12:3 = scf.for %arg2 = %c0_i32 to %c7_i32 step %c1_i32 iter_args(%arg3 = %cst_0, %arg4 = %c-1_i32, %arg5 = %11) -> (tensor<128x64xf32, #mma>, i32, tensor<64x16xf16, #blocked1>)  : i32 {
    %22 = arith.addi %arg4, %c1_i32 : i32
    %23 = arith.cmpi slt, %22, %c1_i32 : i32
    %24 = arith.select %23, %22, %c0_i32 : i32
    %25 = tt.load %10 : tensor<64x16x!tt.ptr<f16>, #blocked1>
    %26 = ttg.convert_layout %5 : tensor<128x64xf16, #blocked> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
    %27 = ttg.convert_layout %arg5 : tensor<64x16xf16, #blocked1> -> tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
    %28 = tt.dot %26, %27, %cst : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
    %29 = arith.truncf %28 : tensor<128x16xf32, #mma> to tensor<128x16xf16, #mma>
    %30 = ttg.convert_layout %29 : tensor<128x16xf16, #mma> -> tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
    %31 = ttg.local_alloc %arg5 : (tensor<64x16xf16, #blocked1>) -> !ttg.memdesc<64x16xf16, #shared, #smem>
    %32 = ttg.memdesc_trans %31 {order = array<i32: 1, 0>} : !ttg.memdesc<64x16xf16, #shared, #smem> -> !ttg.memdesc<16x64xf16, #shared1, #smem>
    %33 = ttg.local_load %32 : !ttg.memdesc<16x64xf16, #shared1, #smem> -> tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
    %34 = tt.dot %30, %33, %arg3 : tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
    scf.yield %34, %24, %25 : tensor<128x64xf32, #mma>, i32, tensor<64x16xf16, #blocked1>
  }
  %13 = ttg.convert_layout %5 : tensor<128x64xf16, #blocked> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
  %14 = ttg.convert_layout %12#2 : tensor<64x16xf16, #blocked1> -> tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
  %15 = tt.dot %13, %14, %cst : tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<64x16xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
  %16 = arith.truncf %15 : tensor<128x16xf32, #mma> to tensor<128x16xf16, #mma>
  %17 = ttg.convert_layout %16 : tensor<128x16xf16, #mma> -> tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
  %18 = ttg.local_alloc %12#2 : (tensor<64x16xf16, #blocked1>) -> !ttg.memdesc<64x16xf16, #shared, #smem>
  %19 = ttg.memdesc_trans %18 {order = array<i32: 1, 0>} : !ttg.memdesc<64x16xf16, #shared, #smem> -> !ttg.memdesc<16x64xf16, #shared1, #smem>
  %20 = ttg.local_load %19 : !ttg.memdesc<16x64xf16, #shared1, #smem> -> tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
  %21 = tt.dot %17, %20, %12#0 : tensor<128x16xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<16x64xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
  tt.return %15, %21 : tensor<128x16xf32, #mma>, tensor<128x64xf32, #mma>
}

This is fortunately fixed in #7222 where we move to using the core canHaveSharedEncoding inside of isPipeliningBeneficial.

@makslevental makslevental force-pushed the makslevental/broken-pipeline branch from 04de539 to 39bcc4a Compare June 26, 2025 17:34
@makslevental makslevental changed the title [AMD] fails to fail to pipeline [AMD][NOMERGE] fails to fail to pipeline Jun 26, 2025
@makslevental
Copy link
Contributor Author

makslevental commented Jun 26, 2025

Just to be clear on my reasoning here: the first RUN line in this lit test does the correct thing (succeeds in not pipelining the loop at all)

// RUN: triton-opt %s -split-input-file -tritongpu-assign-latencies \
    -tritongpu-schedule-loops -tritongpu-pipeline=num-stages=3 -canonicalize 

i.e the core pipeliner does the correct thing and AMD's pipeliner does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant