-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[AMDGPU] Fix a potential integer overflow in GCNRegPressure when true16 is enabled #144968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesFixes SWDEV-537014. Full diff: https://github.com/llvm/llvm-project/pull/144968.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
index ce213b91b1f7e..7281524b4d53d 100644
--- a/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
@@ -66,7 +66,23 @@ void GCNRegPressure::inc(unsigned Reg,
Value[TupleIdx] += Sign * TRI->getRegClassWeight(RC).RegWeight;
}
// Pressure scales with number of new registers covered by the new mask.
- Sign *= SIRegisterInfo::getNumCoveredRegs(~PrevMask & NewMask);
+ // Note that, when true16 is enabled, we can no longer use the following
+ // code to calculate the difference of number of 32-bit registers between
+ // the two mask:
+ //
+ // Sign *= SIRegisterInfo::getNumCoveredRegs(~PrevMask & NewMask);
+ //
+ // The reason is, the new mask `~PrevMask & NewMask` doesn't treat a 16-bit
+ // register use as a whole 32-bit register use.
+ //
+ // Let's take a look at an example. Assume PrevMask = 0b0010, and NewMask =
+ // 0b1111. The difference in this case should be 1, because even though
+ // PrevMask only uses half of a 32-bit register, we still need to count it
+ // as a whole. However, `~PrevMask & NewMask` gives us 0b1101, and then
+ // `getNumCoveredRegs` will return 2 in this case, which can cause integer
+ // overflow if Sign = -1.
+ Sign *= SIRegisterInfo::getNumCoveredRegs(NewMask) -
+ SIRegisterInfo::getNumCoveredRegs(PrevMask);
}
Value[RegKind] += Sign;
}
diff --git a/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir b/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
new file mode 100644
index 0000000000000..9aac3d74eea4f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
@@ -0,0 +1,70 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -x mir -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1102 -run-pass=machine-scheduler %s -o - | FileCheck %s
+
+--- |
+ declare void @llvm.amdgcn.s.waitcnt(i32 immarg)
+
+ declare <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32>, i32, i32, i32 immarg)
+
+ define amdgpu_kernel void @foo(ptr %p) {
+ entry:
+ %foo.kernarg.segment = call nonnull align 16 dereferenceable(264) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+ %p.kernarg.offset1 = bitcast ptr addrspace(4) %foo.kernarg.segment to ptr addrspace(4)
+ %p.load = load ptr, ptr addrspace(4) %p.kernarg.offset1, align 16
+ %call = tail call <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32> zeroinitializer, i32 0, i32 0, i32 0)
+ %cast = bitcast <2 x i32> %call to <8 x i8>
+ %shuffle = shufflevector <8 x i8> zeroinitializer, <8 x i8> %cast, <2 x i32> <i32 3, i32 11>
+ %zext = zext <2 x i8> %shuffle to <2 x i16>
+ %shl = shl <2 x i16> %zext, splat (i16 8)
+ store <2 x i16> %shl, ptr %p.load, align 4
+ tail call void @llvm.amdgcn.s.waitcnt(i32 0)
+ ret void
+ }
+
+ declare noundef align 4 ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+...
+---
+name: foo
+tracksRegLiveness: true
+liveins:
+ - { reg: '$sgpr4_sgpr5', virtual-reg: '%3' }
+body: |
+ bb.0.entry:
+ liveins: $sgpr4_sgpr5
+
+ ; CHECK-LABEL: name: foo
+ ; CHECK: liveins: $sgpr4_sgpr5
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:sgpr_64(p4) = COPY $sgpr4_sgpr5
+ ; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
+ ; CHECK-NEXT: undef [[COPY1:%[0-9]+]].sub0:sgpr_128 = COPY [[S_MOV_B32_]]
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub1:sgpr_128 = COPY [[S_MOV_B32_]]
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub2:sgpr_128 = COPY [[S_MOV_B32_]]
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]].sub3:sgpr_128 = COPY [[S_MOV_B32_]]
+ ; CHECK-NEXT: [[BUFFER_LOAD_DWORDX2_OFFSET:%[0-9]+]]:vreg_64 = BUFFER_LOAD_DWORDX2_OFFSET [[COPY1]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 8)
+ ; CHECK-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 0, 0 :: (dereferenceable invariant load (s64) from %ir.p.kernarg.offset1, align 16, addrspace 4)
+ ; CHECK-NEXT: [[V_LSHRREV_B64_e64_:%[0-9]+]]:vreg_64 = V_LSHRREV_B64_e64 24, [[BUFFER_LOAD_DWORDX2_OFFSET]], implicit $exec
+ ; CHECK-NEXT: undef [[COPY2:%[0-9]+]].lo16:vgpr_32 = COPY [[V_LSHRREV_B64_e64_]].lo16
+ ; CHECK-NEXT: [[V_LSHLREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHLREV_B32_e64 16, [[COPY2]], implicit $exec
+ ; CHECK-NEXT: [[COPY3:%[0-9]+]]:vreg_64 = COPY [[S_LOAD_DWORDX2_IMM]]
+ ; CHECK-NEXT: [[V_PK_LSHLREV_B16_:%[0-9]+]]:vgpr_32 = V_PK_LSHLREV_B16 0, 8, 8, [[V_LSHLREV_B32_e64_]], 0, 0, 0, 0, 0, implicit $exec
+ ; CHECK-NEXT: FLAT_STORE_DWORD [[COPY3]], [[V_PK_LSHLREV_B16_]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %ir.p.load)
+ ; CHECK-NEXT: S_WAITCNT 0
+ ; CHECK-NEXT: S_ENDPGM 0
+ %3:sgpr_64(p4) = COPY killed $sgpr4_sgpr5
+ %13:sreg_64_xexec = S_LOAD_DWORDX2_IMM killed %3(p4), 0, 0 :: (dereferenceable invariant load (s64) from %ir.p.kernarg.offset1, align 16, addrspace 4)
+ %14:sreg_32 = S_MOV_B32 0
+ undef %15.sub0:sgpr_128 = COPY %14
+ %15.sub1:sgpr_128 = COPY %14
+ %15.sub2:sgpr_128 = COPY %14
+ %15.sub3:sgpr_128 = COPY killed %14
+ %16:vreg_64 = BUFFER_LOAD_DWORDX2_OFFSET killed %15, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s64), align 1, addrspace 8)
+ %26:vreg_64 = V_LSHRREV_B64_e64 24, killed %16, implicit $exec
+ undef %28.lo16:vgpr_32 = COPY killed %26.lo16
+ %30:vgpr_32 = V_LSHLREV_B32_e64 16, killed %28, implicit $exec
+ %24:vgpr_32 = V_PK_LSHLREV_B16 0, 8, 8, killed %30, 0, 0, 0, 0, 0, implicit $exec
+ %25:vreg_64 = COPY killed %13
+ FLAT_STORE_DWORD killed %25, killed %24, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32) into %ir.p.load)
+ S_WAITCNT 0
+ S_ENDPGM 0
+...
|
llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
Outdated
Show resolved
Hide resolved
llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
Outdated
Show resolved
Hide resolved
llvm/test/CodeGen/AMDGPU/gcn-reg-pressure-true16-integer-overflow.mir
Outdated
Show resolved
Hide resolved
Sign *= SIRegisterInfo::getNumCoveredRegs(NewMask) - | ||
SIRegisterInfo::getNumCoveredRegs(PrevMask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a register count difference utility in SIRegisterInfo, can probably avoid doing popcount twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we will need to convert any pair of bits in a mask to 0b11
if it is 0b01
or 0b10
. After that, we can continue to use ~PrevMask & NewMask
and then call getNumCoveredRegs
once (popcount once as well). However, I don't know how to do the preprocessing via bit manipulation. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't use bit manipulation, there is probably no point of adding a new utility function, since it would just do what we are doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking inspiration from getNumCoveredRegs
, I think
PrevMask | ((PrevMask & 0xAAAAAAAAAAAAAAAAULL) >> 1) | ((PrevMask & 0x5555555555555555ULL) << 1)
should work to transform 0b01
/0b10
into 0b11
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took an alternative approach, since the two functions are called in the beginning of this function anyway.
cfdc82a
to
4ada892
Compare
…16 is enabled Fixes SWDEV-537014.
4ada892
to
5722a6e
Compare
Fixes SWDEV-537014.