Skip to content

[AMDGPU] Fix a potential integer overflow in GCNRegPressure when true16 is enabled #144968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shiltian
Copy link
Contributor

Fixes SWDEV-537014.

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

Fixes SWDEV-537014.


Full diff: https://github.com/llvm/llvm-project/pull/144968.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNRegPressure.cpp (+17-1)
  • (added) llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir (+70)
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index ce213b91b1f7e..7281524b4d53d 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -66,7 +66,23 @@ void GCNRegPressure::inc(unsigned Reg,
       Value[TupleIdx] += Sign * TRI->getRegClassWeight(RC).RegWeight;
     }
     // Pressure scales with number of new registers covered by the new mask.
-    Sign *= SIRegisterInfo::getNumCoveredRegs(~PrevMask & NewMask);
+    // Note that, when true16 is enabled, we can no longer use the following
+    // code to calculate the difference of number of 32-bit registers between
+    // the two mask:
+    //
+    // Sign *= SIRegisterInfo::getNumCoveredRegs(~PrevMask & NewMask);
+    //
+    // The reason is, the new mask `~PrevMask & NewMask` doesn't treat a 16-bit
+    // register use as a whole 32-bit register use.
+    //
+    // Let's take a look at an example. Assume PrevMask = 0b0010, and NewMask =
+    // 0b1111. The difference in this case should be 1, because even though
+    // PrevMask only uses half of a 32-bit register, we still need to count it
+    // as a whole. However, `~PrevMask & NewMask` gives us 0b1101, and then
+    // `getNumCoveredRegs` will return 2 in this case, which can cause integer
+    // overflow if Sign = -1.
+    Sign *= SIRegisterInfo::getNumCoveredRegs(NewMask) -
+            SIRegisterInfo::getNumCoveredRegs(PrevMask);
   }
   Value[RegKind] += Sign;
 }
diff --git a/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir b/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
new file mode 100644
index 0000000000000..9aac3d74eea4f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
@@ -0,0 +1,70 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -x mir -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1102 -run-pass=machine-scheduler %s -o - | FileCheck %s
+
+--- |
+  declare void @llvm.amdgcn.s.waitcnt(i32 immarg)
+
+  declare <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32>, i32, i32, i32 immarg)
+
+  define amdgpu_kernel void @foo(ptr %p) {
+  entry:
+    %foo.kernarg.segment = call nonnull align 16 dereferenceable(264) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+    %p.kernarg.offset1 = bitcast ptr addrspace(4) %foo.kernarg.segment to ptr addrspace(4)
+    %p.load = load ptr, ptr addrspace(4) %p.kernarg.offset1, align 16
+    %call = tail call <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32> zeroinitializer, i32 0, i32 0, i32 0)
+    %cast = bitcast <2 x i32> %call to <8 x i8>
+    %shuffle = shufflevector <8 x i8> zeroinitializer, <8 x i8> %cast, <2 x i32> <i32 3, i32 11>
+    %zext = zext <2 x i8> %shuffle to <2 x i16>
+    %shl = shl <2 x i16> %zext, splat (i16 8)
+    store <2 x i16> %shl, ptr %p.load, align 4
+    tail call void @llvm.amdgcn.s.waitcnt(i32 0)
+    ret void
+  }
+
+  declare noundef align 4 ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+...
+---
+name:            foo
+tracksRegLiveness: true
+liveins:
+  - { reg: '$sgpr4_sgpr5', virtual-reg: '%3' }
+body:             |
+  bb.0.entry:
+    liveins: $sgpr4_sgpr5
+
+    ; CHECK-LABEL: name: foo
+    ; CHECK: liveins: $sgpr4_sgpr5
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_64(p4) = COPY $sgpr4_sgpr5
+    ; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+    ; CHECK-NEXT: undef [[COPY1:%[0-9]+]].sub0:sgpr_128 = COPY [[S_MOV_B32_]]
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub1:sgpr_128 = COPY [[S_MOV_B32_]]
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub2:sgpr_128 = COPY [[S_MOV_B32_]]
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub3:sgpr_128 = COPY [[S_MOV_B32_]]
+    ; CHECK-NEXT: [[BUFFER_LOAD_DWORDX2_OFFSET:%[0-9]+]]:vreg_64 = BUFFER_LOAD_DWORDX2_OFFSET [[COPY1]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 8)
+    ; CHECK-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 0, 0 :: (dereferenceable invariant load (s64) from %ir.p.kernarg.offset1, align 16, addrspace 4)
+    ; CHECK-NEXT: [[V_LSHRREV_B64_e64_:%[0-9]+]]:vreg_64 = V_LSHRREV_B64_e64 24, [[BUFFER_LOAD_DWORDX2_OFFSET]], implicit $exec
+    ; CHECK-NEXT: undef [[COPY2:%[0-9]+]].lo16:vgpr_32 = COPY [[V_LSHRREV_B64_e64_]].lo16
+    ; CHECK-NEXT: [[V_LSHLREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHLREV_B32_e64 16, [[COPY2]], implicit $exec
+    ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vreg_64 = COPY [[S_LOAD_DWORDX2_IMM]]
+    ; CHECK-NEXT: [[V_PK_LSHLREV_B16_:%[0-9]+]]:vgpr_32 = V_PK_LSHLREV_B16 0, 8, 8, [[V_LSHLREV_B32_e64_]], 0, 0, 0, 0, 0, implicit $exec
+    ; CHECK-NEXT: FLAT_STORE_DWORD [[COPY3]], [[V_PK_LSHLREV_B16_]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %ir.p.load)
+    ; CHECK-NEXT: S_WAITCNT 0
+    ; CHECK-NEXT: S_ENDPGM 0
+    %3:sgpr_64(p4) = COPY killed $sgpr4_sgpr5
+    %13:sreg_64_xexec = S_LOAD_DWORDX2_IMM killed %3(p4), 0, 0 :: (dereferenceable invariant load (s64) from %ir.p.kernarg.offset1, align 16, addrspace 4)
+    %14:sreg_32 = S_MOV_B32 0
+    undef %15.sub0:sgpr_128 = COPY %14
+    %15.sub1:sgpr_128 = COPY %14
+    %15.sub2:sgpr_128 = COPY %14
+    %15.sub3:sgpr_128 = COPY killed %14
+    %16:vreg_64 = BUFFER_LOAD_DWORDX2_OFFSET killed %15, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 8)
+    %26:vreg_64 = V_LSHRREV_B64_e64 24, killed %16, implicit $exec
+    undef %28.lo16:vgpr_32 = COPY killed %26.lo16
+    %30:vgpr_32 = V_LSHLREV_B32_e64 16, killed %28, implicit $exec
+    %24:vgpr_32 = V_PK_LSHLREV_B16 0, 8, 8, killed %30, 0, 0, 0, 0, 0, implicit $exec
+    %25:vreg_64 = COPY killed %13
+    FLAT_STORE_DWORD killed %25, killed %24, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %ir.p.load)
+    S_WAITCNT 0
+    S_ENDPGM 0
+...

Comment on lines 84 to 87
Sign *= SIRegisterInfo::getNumCoveredRegs(NewMask) -
SIRegisterInfo::getNumCoveredRegs(PrevMask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a register count difference utility in SIRegisterInfo, can probably avoid doing popcount twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we will need to convert any pair of bits in a mask to 0b11 if it is 0b01 or 0b10. After that, we can continue to use ~PrevMask & NewMask and then call getNumCoveredRegs once (popcount once as well). However, I don't know how to do the preprocessing via bit manipulation. Any suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't use bit manipulation, there is probably no point of adding a new utility function, since it would just do what we are doing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking inspiration from getNumCoveredRegs, I think

PrevMask | ((PrevMask & 0xAAAAAAAAAAAAAAAAULL) >> 1) | ((PrevMask & 0x5555555555555555ULL) << 1)

should work to transform 0b01/0b10 into 0b11.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took an alternative approach, since the two functions are called in the beginning of this function anyway.

@shiltian shiltian force-pushed the users/shiltian/gcn-reg-pressure-crash-div-by-zero branch 2 times, most recently from cfdc82a to 4ada892 Compare June 20, 2025 02:26
@shiltian shiltian force-pushed the users/shiltian/gcn-reg-pressure-crash-div-by-zero branch from 4ada892 to 5722a6e Compare June 20, 2025 12:44
@shiltian shiltian requested a review from arsenm June 20, 2025 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants