[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare #113158

david-arm · 2024-10-21T12:06:05Z

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare

Whilst reviewing PR #109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.

llvmbot · 2024-10-21T12:06:38Z

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-backend-aarch64

Author: David Sherwood (david-arm)

Changes

Whilst reviewing PR #109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result and register pressure is much
lower. Given they are likely to be more expensive than scalar
comparisons it makes sense to avoid sinking too many. The
CodeGen/SystemZ/vec-perm-14.ll test does rely upon sinking a
vector comparison so I've kept that behaviour by permitting one
sink.

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

Patch is 42.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113158.diff

4 Files Affected:

(modified) llvm/lib/CodeGen/CodeGenPrepare.cpp (+32-21)
(added) llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll (+123)
(modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll (+75-79)
(modified) llvm/test/CodeGen/X86/masked_gather.ll (+150-169)

diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index 86f28293ba9ff8..80ace20bfc67ab 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -1779,29 +1779,35 @@ static bool sinkCmpExpression(CmpInst *Cmp, const TargetLowering &TLI) {
   if (TLI.useSoftFloat() && isa<FCmpInst>(Cmp))
     return false;
 
-  // Only insert a cmp in each block once.
-  DenseMap<BasicBlock *, CmpInst *> InsertedCmps;
+  // Collect a list of non-phis users that are in blocks that are different to
+  // the definition block.
+  BasicBlock *DefBB = Cmp->getParent();
+  SmallSet<User *, 4> Users;
+  for (auto *U : Cmp->users()) {
+    Instruction *User = cast<Instruction>(U);
+    if (isa<PHINode>(User))
+      continue;
 
-  bool MadeChange = false;
-  for (Value::user_iterator UI = Cmp->user_begin(), E = Cmp->user_end();
-       UI != E;) {
-    Use &TheUse = UI.getUse();
-    Instruction *User = cast<Instruction>(*UI);
+    if (User->getParent() == DefBB)
+      continue;
 
-    // Preincrement use iterator so we don't invalidate it.
-    ++UI;
+    Users.insert(User);
+  }
 
-    // Don't bother for PHI nodes.
-    if (isa<PHINode>(User))
-      continue;
+  // If this is a vector comparison the result will likely not depend upon
+  // setting a condition register, and it's probably too expensive to sink too
+  // many times.
+  VectorType *VecType = dyn_cast<VectorType>(Cmp->getType());
+  if (VecType && VecType->getElementCount().isVector() && Users.size() > 1)
+    return false;
 
-    // Figure out which BB this cmp is used in.
-    BasicBlock *UserBB = User->getParent();
-    BasicBlock *DefBB = Cmp->getParent();
+  // Only insert a cmp in each block once.
+  DenseMap<BasicBlock *, CmpInst *> InsertedCmps;
 
-    // If this user is in the same block as the cmp, don't change the cmp.
-    if (UserBB == DefBB)
-      continue;
+  bool MadeChange = false;
+  for (auto *U : Users) {
+    Instruction *UI = cast<Instruction>(U);
+    BasicBlock *UserBB = UI->getParent();
 
     // If we have already inserted a cmp into this block, use it.
     CmpInst *&InsertedCmp = InsertedCmps[UserBB];
@@ -1816,10 +1822,15 @@ static bool sinkCmpExpression(CmpInst *Cmp, const TargetLowering &TLI) {
       InsertedCmp->setDebugLoc(Cmp->getDebugLoc());
     }
 
-    // Replace a use of the cmp with a use of the new cmp.
-    TheUse = InsertedCmp;
+    // Replace all uses of the cmp with a use of the new cmp and update the
+    // number of uses.
+    for (unsigned I = 0; I < U->getNumOperands(); I++)
+      if (U->getOperand(I) == Cmp) {
+        U->setOperand(I, InsertedCmp);
+        NumCmpUses++;
+      }
+
     MadeChange = true;
-    ++NumCmpUses;
   }
 
   // If we removed all uses, nuke the cmp.
diff --git a/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll b/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll
new file mode 100644
index 00000000000000..93879d41a25432
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll
@@ -0,0 +1,123 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+define void @vector_loop_with_icmp(ptr nocapture noundef writeonly %dest) {
+; CHECK-LABEL: vector_loop_with_icmp:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    mov w9, #15 // =0xf
+; CHECK-NEXT:    mov w10, #4 // =0x4
+; CHECK-NEXT:    adrp x8, .LCPI0_0
+; CHECK-NEXT:    adrp x11, .LCPI0_1
+; CHECK-NEXT:    dup v0.2d, x9
+; CHECK-NEXT:    dup v1.2d, x10
+; CHECK-NEXT:    ldr q2, [x8, :lo12:.LCPI0_0]
+; CHECK-NEXT:    ldr q3, [x11, :lo12:.LCPI0_1]
+; CHECK-NEXT:    add x8, x0, #8
+; CHECK-NEXT:    mov w9, #16 // =0x10
+; CHECK-NEXT:    mov w10, #1 // =0x1
+; CHECK-NEXT:    b .LBB0_2
+; CHECK-NEXT:  .LBB0_1: // %pred.store.continue18
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    add v2.2d, v2.2d, v1.2d
+; CHECK-NEXT:    add v3.2d, v3.2d, v1.2d
+; CHECK-NEXT:    subs x9, x9, #4
+; CHECK-NEXT:    add x8, x8, #16
+; CHECK-NEXT:    b.eq .LBB0_10
+; CHECK-NEXT:  .LBB0_2: // %vector.body
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    cmhi v4.2d, v0.2d, v2.2d
+; CHECK-NEXT:    cmhi v5.2d, v0.2d, v3.2d
+; CHECK-NEXT:    uzp1 v4.4s, v5.4s, v4.4s
+; CHECK-NEXT:    xtn v4.4h, v4.4s
+; CHECK-NEXT:    umov w11, v4.h[0]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_6
+; CHECK-NEXT:  // %bb.3: // %pred.store.continue
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[1]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_7
+; CHECK-NEXT:  .LBB0_4: // %pred.store.continue6
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[2]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_8
+; CHECK-NEXT:  .LBB0_5: // %pred.store.continue8
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[3]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_1
+; CHECK-NEXT:    b .LBB0_9
+; CHECK-NEXT:  .LBB0_6: // %pred.store.if
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    stur w10, [x8, #-8]
+; CHECK-NEXT:    umov w11, v4.h[1]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_4
+; CHECK-NEXT:  .LBB0_7: // %pred.store.if5
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    stur w10, [x8, #-4]
+; CHECK-NEXT:    umov w11, v4.h[2]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_5
+; CHECK-NEXT:  .LBB0_8: // %pred.store.if7
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    str w10, [x8]
+; CHECK-NEXT:    umov w11, v4.h[3]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_1
+; CHECK-NEXT:  .LBB0_9: // %pred.store.if9
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    str w10, [x8, #4]
+; CHECK-NEXT:    b .LBB0_1
+; CHECK-NEXT:  .LBB0_10: // %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:
+  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
+  %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
+  %0 = icmp ult <4 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15>
+  %1 = extractelement <4 x i1> %0, i64 0
+  br i1 %1, label %pred.store.if, label %pred.store.continue
+
+pred.store.if:
+  %2 = getelementptr inbounds i32, ptr %dest, i64 %index
+  store i32 1, ptr %2, align 4
+  br label %pred.store.continue
+
+pred.store.continue:
+  %3 = extractelement <4 x i1> %0, i64 1
+  br i1 %3, label %pred.store.if5, label %pred.store.continue6
+
+pred.store.if5:
+  %4 = or disjoint i64 %index, 1
+  %5 = getelementptr inbounds i32, ptr %dest, i64 %4
+  store i32 1, ptr %5, align 4
+  br label %pred.store.continue6
+
+pred.store.continue6:
+  %6 = extractelement <4 x i1> %0, i64 2
+  br i1 %6, label %pred.store.if7, label %pred.store.continue8
+
+pred.store.if7:
+  %7 = or disjoint i64 %index, 2
+  %8 = getelementptr inbounds i32, ptr %dest, i64 %7
+  store i32 1, ptr %8, align 4
+  br label %pred.store.continue8
+
+pred.store.continue8:
+  %9 = extractelement <4 x i1> %0, i64 3
+  br i1 %9, label %pred.store.if9, label %pred.store.continue18
+
+pred.store.if9:
+  %10 = or disjoint i64 %index, 3
+  %11 = getelementptr inbounds i32, ptr %dest, i64 %10
+  store i32 1, ptr %11, align 4
+  br label %pred.store.continue18
+
+pred.store.continue18:
+  %index.next = add i64 %index, 4
+  %vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
+  %24 = icmp eq i64 %index.next, 16
+  br i1 %24, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:
+  ret void
+}
diff --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
index 2fdf534d526565..b9f1e2d21674f3 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
@@ -280,160 +280,156 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(ptr nocapture readonly %b, ptr
 ; CHECK-NEXT:    bxeq lr
 ; CHECK-NEXT:  .LBB2_1: @ %vector.ph
 ; CHECK-NEXT:    push {r4, r5, r7, lr}
-; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    sub sp, #8
 ; CHECK-NEXT:    adds r3, r2, #3
-; CHECK-NEXT:    vmov.i32 q5, #0x0
+; CHECK-NEXT:    vmov.i32 q3, #0x0
 ; CHECK-NEXT:    bic r3, r3, #3
+; CHECK-NEXT:    movs r5, #0
 ; CHECK-NEXT:    sub.w r12, r3, #4
 ; CHECK-NEXT:    movs r3, #1
 ; CHECK-NEXT:    add.w lr, r3, r12, lsr #2
+; CHECK-NEXT:    adr r3, .LCPI2_1
 ; CHECK-NEXT:    sub.w r12, r2, #1
-; CHECK-NEXT:    adr r2, .LCPI2_1
 ; CHECK-NEXT:    mov lr, lr
-; CHECK-NEXT:    vldrw.u32 q0, [r2]
-; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    vldrw.u32 q0, [r3]
 ; CHECK-NEXT:    vdup.32 q1, r12
-; CHECK-NEXT:    vdup.32 q2, r12
 ; CHECK-NEXT:    b .LBB2_3
 ; CHECK-NEXT:  .LBB2_2: @ %else24
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vmul.f16 q5, q6, q5
+; CHECK-NEXT:    vmul.f16 q3, q4, q3
 ; CHECK-NEXT:    adds r0, #8
-; CHECK-NEXT:    vcvtt.f32.f16 s23, s21
-; CHECK-NEXT:    vcvtb.f32.f16 s22, s21
-; CHECK-NEXT:    vcvtt.f32.f16 s21, s20
-; CHECK-NEXT:    vcvtb.f32.f16 s20, s20
+; CHECK-NEXT:    vcvtt.f32.f16 s15, s13
+; CHECK-NEXT:    vcvtb.f32.f16 s14, s13
+; CHECK-NEXT:    vcvtt.f32.f16 s13, s12
+; CHECK-NEXT:    vcvtb.f32.f16 s12, s12
 ; CHECK-NEXT:    adds r1, #8
-; CHECK-NEXT:    adds r3, #4
-; CHECK-NEXT:    vadd.f32 q5, q3, q5
+; CHECK-NEXT:    adds r5, #4
+; CHECK-NEXT:    vadd.f32 q3, q2, q3
 ; CHECK-NEXT:    subs.w lr, lr, #1
 ; CHECK-NEXT:    bne .LBB2_3
 ; CHECK-NEXT:    b .LBB2_19
 ; CHECK-NEXT:  .LBB2_3: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    vadd.i32 q4, q0, r3
-; CHECK-NEXT:    vmov q3, q5
-; CHECK-NEXT:    vcmp.u32 cs, q1, q4
-; CHECK-NEXT:    @ implicit-def: $q5
-; CHECK-NEXT:    vmrs r4, p0
-; CHECK-NEXT:    and r2, r4, #1
-; CHECK-NEXT:    rsbs r5, r2, #0
-; CHECK-NEXT:    movs r2, #0
-; CHECK-NEXT:    bfi r2, r5, #0, #1
-; CHECK-NEXT:    ubfx r5, r4, #4, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #1, #1
-; CHECK-NEXT:    ubfx r5, r4, #8, #1
-; CHECK-NEXT:    ubfx r4, r4, #12, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #2, #1
+; CHECK-NEXT:    vmov q2, q3
+; CHECK-NEXT:    vadd.i32 q3, q0, r5
+; CHECK-NEXT:    vcmp.u32 cs, q1, q3
+; CHECK-NEXT:    @ implicit-def: $q3
+; CHECK-NEXT:    vmrs r2, p0
+; CHECK-NEXT:    and r3, r2, #1
+; CHECK-NEXT:    rsbs r4, r3, #0
+; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    bfi r3, r4, #0, #1
+; CHECK-NEXT:    ubfx r4, r2, #4, #1
 ; CHECK-NEXT:    rsbs r4, r4, #0
-; CHECK-NEXT:    bfi r2, r4, #3, #1
-; CHECK-NEXT:    lsls r4, r2, #31
+; CHECK-NEXT:    bfi r3, r4, #1, #1
+; CHECK-NEXT:    ubfx r4, r2, #8, #1
+; CHECK-NEXT:    ubfx r2, r2, #12, #1
+; CHECK-NEXT:    rsbs r4, r4, #0
+; CHECK-NEXT:    bfi r3, r4, #2, #1
+; CHECK-NEXT:    rsbs r2, r2, #0
+; CHECK-NEXT:    bfi r3, r2, #3, #1
+; CHECK-NEXT:    lsls r2, r3, #31
 ; CHECK-NEXT:    bne .LBB2_12
 ; CHECK-NEXT:  @ %bb.4: @ %else
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bmi .LBB2_13
 ; CHECK-NEXT:  .LBB2_5: @ %else5
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bmi .LBB2_14
 ; CHECK-NEXT:  .LBB2_6: @ %else8
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl .LBB2_8
 ; CHECK-NEXT:  .LBB2_7: @ %cond.load10
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s22, [r0, #6]
-; CHECK-NEXT:    vins.f16 s21, s22
+; CHECK-NEXT:    vldr.16 s14, [r0, #6]
+; CHECK-NEXT:    vins.f16 s13, s14
 ; CHECK-NEXT:  .LBB2_8: @ %else11
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vcmp.u32 cs, q2, q4
-; CHECK-NEXT:    @ implicit-def: $q6
-; CHECK-NEXT:    vmrs r4, p0
-; CHECK-NEXT:    and r2, r4, #1
-; CHECK-NEXT:    rsbs r5, r2, #0
-; CHECK-NEXT:    movs r2, #0
-; CHECK-NEXT:    bfi r2, r5, #0, #1
-; CHECK-NEXT:    ubfx r5, r4, #4, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #1, #1
-; CHECK-NEXT:    ubfx r5, r4, #8, #1
-; CHECK-NEXT:    ubfx r4, r4, #12, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #2, #1
+; CHECK-NEXT:    vmrs r2, p0
+; CHECK-NEXT:    @ implicit-def: $q4
+; CHECK-NEXT:    and r3, r2, #1
+; CHECK-NEXT:    rsbs r4, r3, #0
+; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    bfi r3, r4, #0, #1
+; CHECK-NEXT:    ubfx r4, r2, #4, #1
+; CHECK-NEXT:    rsbs r4, r4, #0
+; CHECK-NEXT:    bfi r3, r4, #1, #1
+; CHECK-NEXT:    ubfx r4, r2, #8, #1
+; CHECK-NEXT:    ubfx r2, r2, #12, #1
 ; CHECK-NEXT:    rsbs r4, r4, #0
-; CHECK-NEXT:    bfi r2, r4, #3, #1
-; CHECK-NEXT:    lsls r4, r2, #31
+; CHECK-NEXT:    bfi r3, r4, #2, #1
+; CHECK-NEXT:    rsbs r2, r2, #0
+; CHECK-NEXT:    bfi r3, r2, #3, #1
+; CHECK-NEXT:    lsls r2, r3, #31
 ; CHECK-NEXT:    bne .LBB2_15
 ; CHECK-NEXT:  @ %bb.9: @ %else15
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bmi .LBB2_16
 ; CHECK-NEXT:  .LBB2_10: @ %else18
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bmi .LBB2_17
 ; CHECK-NEXT:  .LBB2_11: @ %else21
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl .LBB2_2
 ; CHECK-NEXT:    b .LBB2_18
 ; CHECK-NEXT:  .LBB2_12: @ %cond.load
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s20, [r0]
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    vldr.16 s12, [r0]
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bpl .LBB2_5
 ; CHECK-NEXT:  .LBB2_13: @ %cond.load4
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s22, [r0, #2]
-; CHECK-NEXT:    vins.f16 s20, s22
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    vldr.16 s14, [r0, #2]
+; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bpl .LBB2_6
 ; CHECK-NEXT:  .LBB2_14: @ %cond.load7
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s21, [r0, #4]
-; CHECK-NEXT:    vmovx.f16 s22, s0
-; CHECK-NEXT:    vins.f16 s21, s22
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    vldr.16 s13, [r0, #4]
+; CHECK-NEXT:    vmovx.f16 s14, s0
+; CHECK-NEXT:    vins.f16 s13, s14
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bmi .LBB2_7
 ; CHECK-NEXT:    b .LBB2_8
 ; CHECK-NEXT:  .LBB2_15: @ %cond.load14
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s24, [r1]
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    vldr.16 s16, [r1]
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bpl .LBB2_10
 ; CHECK-NEXT:  .LBB2_16: @ %cond.load17
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s26, [r1, #2]
-; CHECK-NEXT:    vins.f16 s24, s26
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    vldr.16 s18, [r1, #2]
+; CHECK-NEXT:    vins.f16 s16, s18
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bpl .LBB2_11
 ; CHECK-NEXT:  .LBB2_17: @ %cond.load20
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s25, [r1, #4]
-; CHECK-NEXT:    vmovx.f16 s26, s0
-; CHECK-NEXT:    vins.f16 s25, s26
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    vldr.16 s17, [r1, #4]
+; CHECK-NEXT:    vmovx.f16 s18, s0
+; CHECK-NEXT:    vins.f16 s17, s18
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl.w .LBB2_2
 ; CHECK-NEXT:  .LBB2_18: @ %cond.load23
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s26, [r1, #6]
-; CHECK-NEXT:    vins.f16 s25, s26
+; CHECK-NEXT:    vldr.16 s18, [r1, #6]
+; CHECK-NEXT:    vins.f16 s17, s18
 ; CHECK-NEXT:    b .LBB2_2
 ; CHECK-NEXT:  .LBB2_19: @ %middle.block
-; CHECK-NEXT:    vdup.32 q0, r12
-; CHECK-NEXT:    vcmp.u32 cs, q0, q4
-; CHECK-NEXT:    vpsel q0, q5, q3
+; CHECK-NEXT:    vpsel q0, q3, q2
 ; CHECK-NEXT:    vmov.f32 s4, s2
 ; CHECK-NEXT:    vmov.f32 s5, s3
 ; CHECK-NEXT:    vadd.f32 q0, q0, q1
 ; CHECK-NEXT:    vmov r0, s1
 ; CHECK-NEXT:    vadd.f32 q0, q0, r0
 ; CHECK-NEXT:    add sp, #8
-; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpop {d8, d9}
 ; CHECK-NEXT:    pop {r4, r5, r7, pc}
 ; CHECK-NEXT:    .p2align 4
 ; CHECK-NEXT:  @ %bb.20:
diff --git a/llvm/test/CodeGen/X86/masked_gather.ll b/llvm/test/CodeGen/X86/masked_gather.ll
index 559a7ec0930b99..c3f46da299c2f8 100644
--- a/llvm/test/CodeGen/X86/masked_gather.ll
+++ b/llvm/test/CodeGen/X86/masked_gather.ll
@@ -1137,12 +1137,12 @@ define <16 x i8> @gather_v16i8_v16i32_v16i8(ptr %base, <16 x i32> %idx, <16 x i8
 define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-LABEL: gather_v8i32_v8i32:
 ; SSE:       # %bb.0:
-; SSE-NEXT:    movdqa %xmm1, %xmm3
 ; SSE-NEXT:    movdqa %xmm0, %xmm2
 ; SSE-NEXT:    pxor %xmm0, %xmm0
 ; SSE-NEXT:    pcmpeqd %xmm0, %xmm1
-; SSE-NEXT:    pcmpeqd %xmm2, %xmm0
-; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    pcmpeqd %xmm0, %xmm2
+; SSE-NEXT:    packssdw %xmm1, %xmm2
+; SSE-NEXT:    movdqa %xmm2, %xmm0
 ; SSE-NEXT:    packsswb %xmm0, %xmm0
 ; SSE-NEXT:    pmovmskb %xmm0, %eax
 ; SSE-NEXT:    testb $1, %al
@@ -1195,17 +1195,13 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:  .LBB4_17: # %cond.load19
 ; SSE-NEXT:    pinsrd $3, c+12(%rip), %xmm1
 ; SSE-NEXT:  .LBB4_18: # %else20
-; SSE-NEXT:    pxor %xmm4, %xmm4
-; SSE-NEXT:    movdqa %xmm2, %xmm5
-; SSE-NEXT:    pcmpeqd %xmm4, %xmm5
-; SSE-NEXT:    pcmpeqd %xmm3, %xmm4
-; SSE-NEXT:    packssdw %xmm4, %xmm5
-; SSE-NEXT:    packsswb %xmm5, %xmm5
-; SSE-NEXT:    pmovmskb %xmm5, %eax
+; SSE-NEXT:    movdqa %xmm2, %xmm3
+; SSE-NEXT:    packsswb %xmm3, %xmm3
+; SSE-NEXT:    pmovmskb %xmm3, %eax
 ; SSE-NEXT:    testb $1, %al
 ; SSE-NEXT:    je .LBB4_19
 ; SSE-NEXT:  # %bb.20: # %cond.load23
-; SSE-NEXT:    movd {{.*#+}} xmm4 = mem[0],zero,zero,zero
+; SSE-NEXT:    movd {{.*#+}} xmm3 = mem[0],zero,zero,zero
 ; SSE-NEXT:    testb $2, %al
 ; SSE-NEXT:    jne .LBB4_22
 ; SSE-NEXT:    jmp .LBB4_23
@@ -1215,11 +1211,11 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    jne .LBB4_17
 ; SSE-NEXT:    jmp .LBB4_18
 ; SSE-NEXT:  .LBB4_19:
-; SSE-NEXT:    # implicit-def: $xmm4
+; SSE-NEXT:    # implicit-def: $xmm3
 ; SSE-NEXT:    testb $2, %al
 ; SSE-NEXT:    je .LBB4_23
 ; SSE-NEXT:  .LBB4_22: # %cond.load28
-; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm3
 ; SSE-NEXT:  .LBB4_23: # %else31
 ; SSE-NEXT:    testb $4, %al
 ; SSE-NEXT:    jne .LBB4_24
@@ -1230,24 +1226,24 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    testb $16, %al
 ; SSE-NEXT:    je .LBB4_28
 ; SSE-NEXT:  .LBB4_29: # %cond.load43
-; SSE-NEXT:    pinsrd $0, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $0, c+28(%rip), %xmm4
 ; SSE-NEXT:    testb $32, %al
 ; SSE-NEXT:    jne .LBB4_31
 ; SSE-NEXT:    jmp .LBB4_32
 ; SSE-NEXT:  .LBB4_24: # %cond.load33
-; SSE-NEXT:    pinsrd $2, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $2, c+28(%rip), %xmm3
 ; SSE-NEXT:    testb $8, %al
 ; SSE-NEXT:    je .LBB4_27
 ; SSE-NEXT:  .LBB4_26: # %cond.load38
-; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm3
 ; SSE-NEXT:    testb $16, %al
 ; SSE-NEXT:    jne .LBB4_29
 ; SSE-NEXT:  .LBB4_28:
-; SSE-NEXT:    # implicit-def: $xmm5
+; SSE-NEXT:    # implicit-def: $xmm4
 ; SSE-NEXT:    testb $32, %al
 ; SSE-NEXT:    je .LBB4_32
 ; SSE-NEXT:  .LBB4_31: # %cond.load48
-; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm4
 ; SSE-NEXT:  .LBB4_32: # %else51
 ; SSE-NEXT:    testb $64, %al
 ; SSE-NEXT:    jne .LBB4_33
@@ -1255,12 +1251,8 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    testb $-128, %al
 ; SSE-NEXT:    je .LBB4_36
 ; SSE-NEXT:  .LBB4_35: # %cond.load58
-; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm4
 ; SSE-NEXT:  .LBB4_36: # %else61
-; SSE-NEXT:    pxor %xmm6, %xmm6
-; SSE-NEXT:    pcmpeqd %xmm6, %xmm2
-; SSE-NEXT:    pcmpeqd %xmm6, %xmm3
-; SSE-NEXT:    packssdw %xmm3, %xmm2
 ; SSE-NEXT:    packsswb %xmm2, %xmm2
 ; SSE-NEXT:    ...
[truncated]

goldsteinn · 2024-10-21T14:20:24Z

llvm/lib/CodeGen/CodeGenPrepare.cpp

+  SmallSet<User *, 4> Users;
+  for (auto *U : Cmp->users()) {
+    Instruction *User = cast<Instruction>(U);
+    if (isa<PHINode>(User))


Maybe keep some full-user iteration below and also skip pseudo instructions that will be dropped during codegen here?

Hi, I've reverted these changes and gone with a TLI hook instead.

goldsteinn · 2024-10-21T14:26:46Z

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters.
Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

davemgreen · 2024-10-21T17:13:49Z

MVE has only a single vector predicate register available, so it might be better to do the sinking. Could you add a test case to check if it looks better?

david-arm · 2024-10-24T08:08:22Z

MVE has only a single vector predicate register available, so it might be better to do the sinking. Could you add a test case to check if it looks better?

That's a fair point. I thought this was being tested by existing tests in llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll. I took a look at the results and they seemed to be around 1-2 instructions less overall, but that doesn't mean it's better of course. :) After chatting privately with @paulwalker-arm perhaps it's better to introduce a hasSingleVectorComparisonRegister or something like that? For NEON, SVE and presumably AVX this returns a number greater than one. For other targets where the hook returns 1 we sink the comparison. I'll play around with this and see where it leads ...

This at least would follow consistent logic for the scalar version.

david-arm · 2024-10-24T08:23:09Z

Hi @davemgreen, I will also add a proper test for MVE similar to the one I added because I realised the existing Thumb2 test changes are for tail-folded loops, which is probably a bit different.

davemgreen · 2024-10-24T09:06:24Z

Yeah thanks. A lot of the tests look like expanded masked load/store, which are going to be a bit different to more normal codegen due to the scalarization. At least in some cases it might be better to scalarize the compare. Hopefully they shouldn't come up much in practice though.

david-arm · 2024-10-24T09:18:26Z

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters. Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

Hmm, it looks like shouldNormalizeToSelectSequence is a virtual function overridden by AArch64 and is used by the DAGCombiner for a different problem. It looks like changing this function would be beyond the scope of this patch, however I am happy to investigate this as a follow-on if that's ok? Is the idea to avoid performing the DAG combine too if for vector comparisons the target has multiple vector predicate registers?

goldsteinn · 2024-10-24T19:08:09Z

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters. Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

Hmm, it looks like shouldNormalizeToSelectSequence is a virtual function overridden by AArch64 and is used by the DAGCombiner for a different problem. It looks like changing this function would be beyond the scope of this patch, however I am happy to investigate this as a follow-on if that's ok? Is the idea to avoid performing the DAG combine too if for vector comparisons the target has multiple vector predicate registers?

Sure, moreso what I meant was it seems we essentially have 2 existing flags that are related to avoiding transforms targeting single-predicate cmp ISAs. If you do choose to add a flags, it probably should co-opt or replace one of those.

goldsteinn · 2024-10-24T19:09:30Z

llvm/include/llvm/CodeGen/TargetLowering.h

+  /// Return true if multiple vector predicate registers are available.
+  bool hasMultipleVectorPredicateRegisters() const {
+    return HasMultipleVectorPredicateRegisters;
+  }


Imo this should just take a type and be hasMultiplePredicateRegister(Type * Ty)

OK, I've added an extra EVT argument because if I understood you correctly from your other comment you'd prefer me to have a single function called from both shouldNormalizeToSelectSequence and sinkCmpExpression. In this case shouldNormalizeToSelectSequence is in DAGCombiner and works with EVTs rather than Types.

jayfoad · 2024-10-25T09:47:37Z

llvm/include/llvm/CodeGen/TargetLowering.h

@@ -2389,7 +2389,7 @@ class TargetLoweringBase {
                                               EVT VT) const {
    // If a target has multiple condition registers, then it likely has logical
    // operations on those registers.
-    if (hasMultipleConditionRegisters())
+    if (hasMultiplePredicateRegisters(VT))


From looking at the use of this hook in DAGCombiner::visitSELECT, I think VT here is the type of X and Y. But hasMultiplePredicateRegisters should be passed the type of the predicate, which is hard coded to MVT::i1. (Please double check this since I find it quite confusing!)

Of course it would help if the comment on this function said what VT is supposed to be.

Ah yes, I think you might be right! I'd just assumed (probably wrongly) that VT was the result type. I've fixed this now.

jayfoad · 2024-10-25T09:48:26Z

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h

@@ -387,6 +387,16 @@ class AMDGPUTargetLowering : public TargetLowering {
  MVT getFenceOperandTy(const DataLayout &DL) const override {
    return MVT::i32;
  }
+
+  virtual bool hasMultiplePredicateRegisters(EVT VT) const override {
+    // FIXME: This is only partially true. If we have to do vector compares,


AMDGPU part looks fine. In future we will probably want to address this FIXME by passing in a node, not just a type.

davemgreen

Can you add these two tests too: https://godbolt.org/z/G1Ma7Gd9v. They are simpler and don't involve scalarization of the predicate (which will hopefully be rare).

davemgreen · 2024-11-05T09:01:09Z

llvm/lib/Target/AArch64/AArch64ISelLowering.h

@@ -1358,6 +1358,10 @@ class AArch64TargetLowering : public TargetLowering {
  unsigned getMinimumJumpTableEntries() const override;

  bool softPromoteHalfType() const override { return true; }
+
+  virtual bool hasMultiplePredicateRegisters(EVT VT) const {


davemgreen · 2024-11-05T09:04:45Z

llvm/test/CodeGen/Thumb2/mve-sink-vector-cmp.ll

+vector.body:
+  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
+  %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
+  %0 = icmp ult <4 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15>


Can you make this 4 x i32, so that it is a legal type under MVE.

Sure, although sadly this is genuinely what the loop vectoriser kicks out, which is why I wrote the test this way. It would be good to fix the vectoriser to choose a more sensible vector induction variable when you know the max range of possible element values.

Do you have a C example? I'm surprised we would try and generate this for MVE, I would have expected the vector to be smaller and i64s to be fairly high cost. Maybe the scalars are too high too.

Oh of course you're right. I think I took the output from the vectoriser for AArch64 and then tried to recreate a MVE test from it. If I compile

void foo(int *dest) { for (int i = 0; i < 14; i++) { dest[i] = 3; } }

with

clang -O3 -fno-unroll-loops --target=thumb-linux-gnueabi -mcpu=cortex-m55 -mllvm -tail-predication=disabled -S -emit-llvm ./lowtrip.c

then I see

vector.body: ; preds = %vector.body, %entry %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %index, i64 0 %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer %vec.iv = or disjoint <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3> %0 = icmp ult <4 x i32> %vec.iv, <i32 14, i32 14, i32 14, i32 14> %1 = getelementptr inbounds i32, ptr %dest, i32 %index tail call void @llvm.masked.store.v4i32.p0(<4 x i32> <i32 3, i32 3, i32 3, i32 3>, ptr %1, i32 4, <4 x i1> %0), !tbaa !5 %index.next = add i32 %index, 4 %2 = icmp eq i32 %index.next, 16 br i1 %2, label %for.cond.cleanup, label %vector.body, !llvm.loop !9

I think a better example where it goes wrong is:

void foo(short *dest) { for (int i = 0; i < 14; i++) { dest[i] = 3; } }

where the vectoriser output is:

vector.body: ; preds = %vector.body, %entry %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <8 x i32> poison, i32 %index, i64 0 %broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> poison, <8 x i32> zeroinitializer %vec.iv = or disjoint <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %0 = icmp ult <8 x i32> %vec.iv, <i32 14, i32 14, i32 14, i32 14, i32 14, i32 14, i32 14, i32 14> %1 = getelementptr inbounds i16, ptr %dest, i32 %index tail call void @llvm.masked.store.v8i16.p0(<8 x i16> <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>, ptr %1, i32 2, <8 x i1> %0), !tbaa !5 %index.next = add i32 %index, 8 %2 = icmp eq i32 %index.next, 16 br i1 %2, label %for.cond.cleanup, label %vector.body, !llvm.loop !9

and this time the <8 x i32> isn't ideal - we could in fact have chosen <8 x i16>. Anyway, for the purpose of this particular PR it doesn't matter I suppose. I'll amend the test to use i32!

davemgreen · 2024-11-05T09:05:00Z

llvm/test/CodeGen/Thumb2/mve-sink-vector-cmp.ll

@@ -0,0 +1,136 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve.fp,+fp-armv8d16sp,+fp16,+fullfp16 < %s | FileCheck %s


-mattr=+mve I think should be enough

david-arm · 2024-11-05T10:13:02Z

Can you add these two tests too: https://godbolt.org/z/G1Ma7Gd9v. They are simpler and don't involve scalarization of the predicate (which will hopefully be rare).

Damn, just realised I missed this. I'll take a look now.

Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.

david-arm · 2024-11-05T11:11:51Z

Added new tests

davemgreen

Sorry for the delay I was going through things in a strange order. I had previously only checked the test on AArch64. Do we need to worry about the extra xtn/sshll this introduces? I was going to suggest we just remove them post-isel, but that might require getting the top bits correct which requires a lot of infrastructure that currently doesn't exist.

david-arm · 2024-11-08T09:35:58Z

Sorry for the delay I was going through things in a strange order. I had previously only checked the test on AArch64. Do we need to worry about the extra xtn/sshll this introduces? I was going to suggest we just remove them post-isel, but that might require getting the top bits correct which requires a lot of infrastructure that currently doesn't exist.

Do you mean for the no_sink_simple case? So the typical use case that we see here is when vectorising a loop with a condition and having predicated blocks that depend upon each element of the result. So when there are 4 or more uses to sink (e.g. vector_loop_with_icmp) this definitely gives some significant performance gains, which was the motivation for this patch. You could be right that this isn't worth it when there are only two uses, but I'm not sure how to account for that. If possible I'd prefer to keep the hasMultiplePredicateRegisters function as it is because I think it's more useful as a standalone hook rather than being tied to the specifics of sinking a comparison instruction. It looks like sinkCmpExpression was never originally intended for sinking vector instructions, and presumably this just started happening by accident for vectors without necessarily doing a cost-benefit analysis. The only clean way I can think of to work around this is by introducing a brand new TLI hook also called shouldConsiderSinkingCmp that checks if the number of uses of a fixed-width vector are > 2, e.g.

static bool sinkCmpExpression(const DataLayout &DL, CmpInst *Cmp,
                              const TargetLowering &TLI) {
  EVT ResVT = TLI.getValueType(DL, Cmp->getType());
  if (!TLI.shouldConsiderSinkingCmp(Cmp) || TLI.hasMultiplePredicateRegisters(ResVT))
    return false;

  // Avoid sinking soft-FP comparisons, since this can move them into a loop.
  if (TLI.useSoftFloat() && isa<FCmpInst>(Cmp))
    return false;

where shouldConsiderSinkingCmp returns true in the default case and for AArch64 does something like this:

bool shouldConsiderSinkingCmp(CmpInst *Cmp) {
  auto VecTy = dyn_cast<FixedVectorType>(Cmp->getType());
  return !VecTy || Cmp->getNumUsers() > 2;
}

What do you think? Or it could go one step further and look at the users to see if they are extractelement or not, i.e. return false if and only if it's a 2-element vector where each user is an extract?

davemgreen · 2024-11-10T14:08:32Z

I believe it is less about the number of uses and more about the type legalization of the uses. v4i1 gets legalized to v4i16, which will insert xtn and sshll if the original instructions were v4i32. The cross-basic block nature makes it difficult for DAG to deal with. If the uses are extracts then the problem is lessened as extracts can always use the type it sees.

I gave it a test though and it seemed OK, so LGTM. We might want to look at removing the legalization artefacts too.

paulwalker-arm

Hi @david-arm, sorry for the late review but when looking at the tests I think I've misunderstood our original conversation. While hasMultipleConditionRegisters seems like an overly specific question for the affected code, on reflection I'm not sure that matters.

Looking at vector_loop_with_icmp I can see that not sinking the compare improves the output, but isn't the original output only bad because the code generator is missing an obvious extractelt (setcc X, C), IndexC => setcc (extractelt X, IndexC), C combine? which would be better all-round.

david-arm · 2024-11-12T09:22:06Z

Hi @david-arm, sorry for the late review but when looking at the tests I think I've misunderstood our original conversation. While hasMultipleConditionRegisters seems like an overly specific question for the affected code, on reflection I'm not sure that matters.

Looking at vector_loop_with_icmp I can see that not sinking the compare improves the output, but isn't the original output only bad because the code generator is missing an obvious extractelt (setcc X, C), IndexC => setcc (extractelt X, IndexC), C combine? which would be better all-round.

I'm happy to have a look at your suggestion and I can believe it helps, but not sure that in general it's a good idea to be sinking unless you can prove the vector comparison will disappear? Anyway, I can try the codegen first and see if this patch still helps on top of it.

david-arm · 2024-11-13T11:47:12Z

I've followed the suggestion by @paulwalker-arm and I will abandon this patch in favour of #116031

david-arm requested review from RKSimon, davemgreen and paulwalker-arm October 21, 2024 12:06

llvmbot added backend:AArch64 backend:X86 labels Oct 21, 2024

goldsteinn reviewed Oct 21, 2024

View reviewed changes

david-arm force-pushed the no_sink_vcmps branch from fb59ac6 to 97fcc44 Compare October 24, 2024 10:22

goldsteinn reviewed Oct 24, 2024

View reviewed changes

david-arm force-pushed the no_sink_vcmps branch from 97fcc44 to 4de421e Compare October 25, 2024 09:28

llvmbot added backend:AMDGPU backend:PowerPC labels Oct 25, 2024

jayfoad reviewed Oct 25, 2024

View reviewed changes

davemgreen reviewed Nov 5, 2024

View reviewed changes

david-arm force-pushed the no_sink_vcmps branch from ec66548 to e9624af Compare November 5, 2024 10:11

david-arm added 2 commits November 5, 2024 11:07

Add tests

608fee8

david-arm force-pushed the no_sink_vcmps branch from e9624af to e4f714d Compare November 5, 2024 11:10

davemgreen reviewed Nov 8, 2024

View reviewed changes

davemgreen approved these changes Nov 10, 2024

View reviewed changes

paulwalker-arm reviewed Nov 11, 2024

View reviewed changes

david-arm closed this Dec 12, 2024

david-arm deleted the no_sink_vcmps branch January 28, 2025 11:48

		@@ -0,0 +1,136 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve.fp,+fp-armv8d16sp,+fp16,+fullfp16 < %s \| FileCheck %s

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare #113158

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare #113158

Uh oh!

Conversation

david-arm commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goldsteinn commented Oct 21, 2024

Uh oh!

davemgreen commented Oct 21, 2024

Uh oh!

david-arm commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm commented Oct 24, 2024

Uh oh!

davemgreen commented Oct 24, 2024

Uh oh!

david-arm commented Oct 24, 2024

Uh oh!

goldsteinn commented Oct 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-arm commented Nov 5, 2024

Uh oh!

david-arm commented Nov 5, 2024

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

david-arm commented Nov 8, 2024

Uh oh!

davemgreen commented Nov 10, 2024

Uh oh!

paulwalker-arm left a comment

Choose a reason for hiding this comment

Uh oh!

david-arm commented Nov 12, 2024

Uh oh!

david-arm commented Nov 13, 2024

Uh oh!

Uh oh!

david-arm commented Oct 21, 2024 •

edited

Loading

llvmbot commented Oct 21, 2024 •

edited

Loading

david-arm commented Oct 24, 2024 •

edited

Loading