Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare #113158

Closed
wants to merge 2 commits into from

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Oct 21, 2024

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare

Whilst reviewing PR #109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.

@llvmbot
Copy link
Member

llvmbot commented Oct 21, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-backend-aarch64

Author: David Sherwood (david-arm)

Changes

Whilst reviewing PR #109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result and register pressure is much
lower. Given they are likely to be more expensive than scalar
comparisons it makes sense to avoid sinking too many. The
CodeGen/SystemZ/vec-perm-14.ll test does rely upon sinking a
vector comparison so I've kept that behaviour by permitting one
sink.

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?


Patch is 42.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113158.diff

4 Files Affected:

  • (modified) llvm/lib/CodeGen/CodeGenPrepare.cpp (+32-21)
  • (added) llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll (+123)
  • (modified) llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll (+75-79)
  • (modified) llvm/test/CodeGen/X86/masked_gather.ll (+150-169)
diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index 86f28293ba9ff8..80ace20bfc67ab 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -1779,29 +1779,35 @@ static bool sinkCmpExpression(CmpInst *Cmp, const TargetLowering &TLI) {
   if (TLI.useSoftFloat() && isa<FCmpInst>(Cmp))
     return false;
 
-  // Only insert a cmp in each block once.
-  DenseMap<BasicBlock *, CmpInst *> InsertedCmps;
+  // Collect a list of non-phis users that are in blocks that are different to
+  // the definition block.
+  BasicBlock *DefBB = Cmp->getParent();
+  SmallSet<User *, 4> Users;
+  for (auto *U : Cmp->users()) {
+    Instruction *User = cast<Instruction>(U);
+    if (isa<PHINode>(User))
+      continue;
 
-  bool MadeChange = false;
-  for (Value::user_iterator UI = Cmp->user_begin(), E = Cmp->user_end();
-       UI != E;) {
-    Use &TheUse = UI.getUse();
-    Instruction *User = cast<Instruction>(*UI);
+    if (User->getParent() == DefBB)
+      continue;
 
-    // Preincrement use iterator so we don't invalidate it.
-    ++UI;
+    Users.insert(User);
+  }
 
-    // Don't bother for PHI nodes.
-    if (isa<PHINode>(User))
-      continue;
+  // If this is a vector comparison the result will likely not depend upon
+  // setting a condition register, and it's probably too expensive to sink too
+  // many times.
+  VectorType *VecType = dyn_cast<VectorType>(Cmp->getType());
+  if (VecType && VecType->getElementCount().isVector() && Users.size() > 1)
+    return false;
 
-    // Figure out which BB this cmp is used in.
-    BasicBlock *UserBB = User->getParent();
-    BasicBlock *DefBB = Cmp->getParent();
+  // Only insert a cmp in each block once.
+  DenseMap<BasicBlock *, CmpInst *> InsertedCmps;
 
-    // If this user is in the same block as the cmp, don't change the cmp.
-    if (UserBB == DefBB)
-      continue;
+  bool MadeChange = false;
+  for (auto *U : Users) {
+    Instruction *UI = cast<Instruction>(U);
+    BasicBlock *UserBB = UI->getParent();
 
     // If we have already inserted a cmp into this block, use it.
     CmpInst *&InsertedCmp = InsertedCmps[UserBB];
@@ -1816,10 +1822,15 @@ static bool sinkCmpExpression(CmpInst *Cmp, const TargetLowering &TLI) {
       InsertedCmp->setDebugLoc(Cmp->getDebugLoc());
     }
 
-    // Replace a use of the cmp with a use of the new cmp.
-    TheUse = InsertedCmp;
+    // Replace all uses of the cmp with a use of the new cmp and update the
+    // number of uses.
+    for (unsigned I = 0; I < U->getNumOperands(); I++)
+      if (U->getOperand(I) == Cmp) {
+        U->setOperand(I, InsertedCmp);
+        NumCmpUses++;
+      }
+
     MadeChange = true;
-    ++NumCmpUses;
   }
 
   // If we removed all uses, nuke the cmp.
diff --git a/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll b/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll
new file mode 100644
index 00000000000000..93879d41a25432
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/no-sink-vector-cmp.ll
@@ -0,0 +1,123 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+define void @vector_loop_with_icmp(ptr nocapture noundef writeonly %dest) {
+; CHECK-LABEL: vector_loop_with_icmp:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    mov w9, #15 // =0xf
+; CHECK-NEXT:    mov w10, #4 // =0x4
+; CHECK-NEXT:    adrp x8, .LCPI0_0
+; CHECK-NEXT:    adrp x11, .LCPI0_1
+; CHECK-NEXT:    dup v0.2d, x9
+; CHECK-NEXT:    dup v1.2d, x10
+; CHECK-NEXT:    ldr q2, [x8, :lo12:.LCPI0_0]
+; CHECK-NEXT:    ldr q3, [x11, :lo12:.LCPI0_1]
+; CHECK-NEXT:    add x8, x0, #8
+; CHECK-NEXT:    mov w9, #16 // =0x10
+; CHECK-NEXT:    mov w10, #1 // =0x1
+; CHECK-NEXT:    b .LBB0_2
+; CHECK-NEXT:  .LBB0_1: // %pred.store.continue18
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    add v2.2d, v2.2d, v1.2d
+; CHECK-NEXT:    add v3.2d, v3.2d, v1.2d
+; CHECK-NEXT:    subs x9, x9, #4
+; CHECK-NEXT:    add x8, x8, #16
+; CHECK-NEXT:    b.eq .LBB0_10
+; CHECK-NEXT:  .LBB0_2: // %vector.body
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    cmhi v4.2d, v0.2d, v2.2d
+; CHECK-NEXT:    cmhi v5.2d, v0.2d, v3.2d
+; CHECK-NEXT:    uzp1 v4.4s, v5.4s, v4.4s
+; CHECK-NEXT:    xtn v4.4h, v4.4s
+; CHECK-NEXT:    umov w11, v4.h[0]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_6
+; CHECK-NEXT:  // %bb.3: // %pred.store.continue
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[1]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_7
+; CHECK-NEXT:  .LBB0_4: // %pred.store.continue6
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[2]
+; CHECK-NEXT:    tbnz w11, #0, .LBB0_8
+; CHECK-NEXT:  .LBB0_5: // %pred.store.continue8
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    umov w11, v4.h[3]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_1
+; CHECK-NEXT:    b .LBB0_9
+; CHECK-NEXT:  .LBB0_6: // %pred.store.if
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    stur w10, [x8, #-8]
+; CHECK-NEXT:    umov w11, v4.h[1]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_4
+; CHECK-NEXT:  .LBB0_7: // %pred.store.if5
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    stur w10, [x8, #-4]
+; CHECK-NEXT:    umov w11, v4.h[2]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_5
+; CHECK-NEXT:  .LBB0_8: // %pred.store.if7
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    str w10, [x8]
+; CHECK-NEXT:    umov w11, v4.h[3]
+; CHECK-NEXT:    tbz w11, #0, .LBB0_1
+; CHECK-NEXT:  .LBB0_9: // %pred.store.if9
+; CHECK-NEXT:    // in Loop: Header=BB0_2 Depth=1
+; CHECK-NEXT:    str w10, [x8, #4]
+; CHECK-NEXT:    b .LBB0_1
+; CHECK-NEXT:  .LBB0_10: // %for.cond.cleanup
+; CHECK-NEXT:    ret
+entry:
+  br label %vector.body
+
+vector.body:
+  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
+  %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
+  %0 = icmp ult <4 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15>
+  %1 = extractelement <4 x i1> %0, i64 0
+  br i1 %1, label %pred.store.if, label %pred.store.continue
+
+pred.store.if:
+  %2 = getelementptr inbounds i32, ptr %dest, i64 %index
+  store i32 1, ptr %2, align 4
+  br label %pred.store.continue
+
+pred.store.continue:
+  %3 = extractelement <4 x i1> %0, i64 1
+  br i1 %3, label %pred.store.if5, label %pred.store.continue6
+
+pred.store.if5:
+  %4 = or disjoint i64 %index, 1
+  %5 = getelementptr inbounds i32, ptr %dest, i64 %4
+  store i32 1, ptr %5, align 4
+  br label %pred.store.continue6
+
+pred.store.continue6:
+  %6 = extractelement <4 x i1> %0, i64 2
+  br i1 %6, label %pred.store.if7, label %pred.store.continue8
+
+pred.store.if7:
+  %7 = or disjoint i64 %index, 2
+  %8 = getelementptr inbounds i32, ptr %dest, i64 %7
+  store i32 1, ptr %8, align 4
+  br label %pred.store.continue8
+
+pred.store.continue8:
+  %9 = extractelement <4 x i1> %0, i64 3
+  br i1 %9, label %pred.store.if9, label %pred.store.continue18
+
+pred.store.if9:
+  %10 = or disjoint i64 %index, 3
+  %11 = getelementptr inbounds i32, ptr %dest, i64 %10
+  store i32 1, ptr %11, align 4
+  br label %pred.store.continue18
+
+pred.store.continue18:
+  %index.next = add i64 %index, 4
+  %vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
+  %24 = icmp eq i64 %index.next, 16
+  br i1 %24, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup:
+  ret void
+}
diff --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
index 2fdf534d526565..b9f1e2d21674f3 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
@@ -280,160 +280,156 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(ptr nocapture readonly %b, ptr
 ; CHECK-NEXT:    bxeq lr
 ; CHECK-NEXT:  .LBB2_1: @ %vector.ph
 ; CHECK-NEXT:    push {r4, r5, r7, lr}
-; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    sub sp, #8
 ; CHECK-NEXT:    adds r3, r2, #3
-; CHECK-NEXT:    vmov.i32 q5, #0x0
+; CHECK-NEXT:    vmov.i32 q3, #0x0
 ; CHECK-NEXT:    bic r3, r3, #3
+; CHECK-NEXT:    movs r5, #0
 ; CHECK-NEXT:    sub.w r12, r3, #4
 ; CHECK-NEXT:    movs r3, #1
 ; CHECK-NEXT:    add.w lr, r3, r12, lsr #2
+; CHECK-NEXT:    adr r3, .LCPI2_1
 ; CHECK-NEXT:    sub.w r12, r2, #1
-; CHECK-NEXT:    adr r2, .LCPI2_1
 ; CHECK-NEXT:    mov lr, lr
-; CHECK-NEXT:    vldrw.u32 q0, [r2]
-; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    vldrw.u32 q0, [r3]
 ; CHECK-NEXT:    vdup.32 q1, r12
-; CHECK-NEXT:    vdup.32 q2, r12
 ; CHECK-NEXT:    b .LBB2_3
 ; CHECK-NEXT:  .LBB2_2: @ %else24
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vmul.f16 q5, q6, q5
+; CHECK-NEXT:    vmul.f16 q3, q4, q3
 ; CHECK-NEXT:    adds r0, #8
-; CHECK-NEXT:    vcvtt.f32.f16 s23, s21
-; CHECK-NEXT:    vcvtb.f32.f16 s22, s21
-; CHECK-NEXT:    vcvtt.f32.f16 s21, s20
-; CHECK-NEXT:    vcvtb.f32.f16 s20, s20
+; CHECK-NEXT:    vcvtt.f32.f16 s15, s13
+; CHECK-NEXT:    vcvtb.f32.f16 s14, s13
+; CHECK-NEXT:    vcvtt.f32.f16 s13, s12
+; CHECK-NEXT:    vcvtb.f32.f16 s12, s12
 ; CHECK-NEXT:    adds r1, #8
-; CHECK-NEXT:    adds r3, #4
-; CHECK-NEXT:    vadd.f32 q5, q3, q5
+; CHECK-NEXT:    adds r5, #4
+; CHECK-NEXT:    vadd.f32 q3, q2, q3
 ; CHECK-NEXT:    subs.w lr, lr, #1
 ; CHECK-NEXT:    bne .LBB2_3
 ; CHECK-NEXT:    b .LBB2_19
 ; CHECK-NEXT:  .LBB2_3: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    vadd.i32 q4, q0, r3
-; CHECK-NEXT:    vmov q3, q5
-; CHECK-NEXT:    vcmp.u32 cs, q1, q4
-; CHECK-NEXT:    @ implicit-def: $q5
-; CHECK-NEXT:    vmrs r4, p0
-; CHECK-NEXT:    and r2, r4, #1
-; CHECK-NEXT:    rsbs r5, r2, #0
-; CHECK-NEXT:    movs r2, #0
-; CHECK-NEXT:    bfi r2, r5, #0, #1
-; CHECK-NEXT:    ubfx r5, r4, #4, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #1, #1
-; CHECK-NEXT:    ubfx r5, r4, #8, #1
-; CHECK-NEXT:    ubfx r4, r4, #12, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #2, #1
+; CHECK-NEXT:    vmov q2, q3
+; CHECK-NEXT:    vadd.i32 q3, q0, r5
+; CHECK-NEXT:    vcmp.u32 cs, q1, q3
+; CHECK-NEXT:    @ implicit-def: $q3
+; CHECK-NEXT:    vmrs r2, p0
+; CHECK-NEXT:    and r3, r2, #1
+; CHECK-NEXT:    rsbs r4, r3, #0
+; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    bfi r3, r4, #0, #1
+; CHECK-NEXT:    ubfx r4, r2, #4, #1
 ; CHECK-NEXT:    rsbs r4, r4, #0
-; CHECK-NEXT:    bfi r2, r4, #3, #1
-; CHECK-NEXT:    lsls r4, r2, #31
+; CHECK-NEXT:    bfi r3, r4, #1, #1
+; CHECK-NEXT:    ubfx r4, r2, #8, #1
+; CHECK-NEXT:    ubfx r2, r2, #12, #1
+; CHECK-NEXT:    rsbs r4, r4, #0
+; CHECK-NEXT:    bfi r3, r4, #2, #1
+; CHECK-NEXT:    rsbs r2, r2, #0
+; CHECK-NEXT:    bfi r3, r2, #3, #1
+; CHECK-NEXT:    lsls r2, r3, #31
 ; CHECK-NEXT:    bne .LBB2_12
 ; CHECK-NEXT:  @ %bb.4: @ %else
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bmi .LBB2_13
 ; CHECK-NEXT:  .LBB2_5: @ %else5
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bmi .LBB2_14
 ; CHECK-NEXT:  .LBB2_6: @ %else8
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl .LBB2_8
 ; CHECK-NEXT:  .LBB2_7: @ %cond.load10
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s22, [r0, #6]
-; CHECK-NEXT:    vins.f16 s21, s22
+; CHECK-NEXT:    vldr.16 s14, [r0, #6]
+; CHECK-NEXT:    vins.f16 s13, s14
 ; CHECK-NEXT:  .LBB2_8: @ %else11
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vcmp.u32 cs, q2, q4
-; CHECK-NEXT:    @ implicit-def: $q6
-; CHECK-NEXT:    vmrs r4, p0
-; CHECK-NEXT:    and r2, r4, #1
-; CHECK-NEXT:    rsbs r5, r2, #0
-; CHECK-NEXT:    movs r2, #0
-; CHECK-NEXT:    bfi r2, r5, #0, #1
-; CHECK-NEXT:    ubfx r5, r4, #4, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #1, #1
-; CHECK-NEXT:    ubfx r5, r4, #8, #1
-; CHECK-NEXT:    ubfx r4, r4, #12, #1
-; CHECK-NEXT:    rsbs r5, r5, #0
-; CHECK-NEXT:    bfi r2, r5, #2, #1
+; CHECK-NEXT:    vmrs r2, p0
+; CHECK-NEXT:    @ implicit-def: $q4
+; CHECK-NEXT:    and r3, r2, #1
+; CHECK-NEXT:    rsbs r4, r3, #0
+; CHECK-NEXT:    movs r3, #0
+; CHECK-NEXT:    bfi r3, r4, #0, #1
+; CHECK-NEXT:    ubfx r4, r2, #4, #1
+; CHECK-NEXT:    rsbs r4, r4, #0
+; CHECK-NEXT:    bfi r3, r4, #1, #1
+; CHECK-NEXT:    ubfx r4, r2, #8, #1
+; CHECK-NEXT:    ubfx r2, r2, #12, #1
 ; CHECK-NEXT:    rsbs r4, r4, #0
-; CHECK-NEXT:    bfi r2, r4, #3, #1
-; CHECK-NEXT:    lsls r4, r2, #31
+; CHECK-NEXT:    bfi r3, r4, #2, #1
+; CHECK-NEXT:    rsbs r2, r2, #0
+; CHECK-NEXT:    bfi r3, r2, #3, #1
+; CHECK-NEXT:    lsls r2, r3, #31
 ; CHECK-NEXT:    bne .LBB2_15
 ; CHECK-NEXT:  @ %bb.9: @ %else15
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bmi .LBB2_16
 ; CHECK-NEXT:  .LBB2_10: @ %else18
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bmi .LBB2_17
 ; CHECK-NEXT:  .LBB2_11: @ %else21
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl .LBB2_2
 ; CHECK-NEXT:    b .LBB2_18
 ; CHECK-NEXT:  .LBB2_12: @ %cond.load
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s20, [r0]
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    vldr.16 s12, [r0]
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bpl .LBB2_5
 ; CHECK-NEXT:  .LBB2_13: @ %cond.load4
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s22, [r0, #2]
-; CHECK-NEXT:    vins.f16 s20, s22
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    vldr.16 s14, [r0, #2]
+; CHECK-NEXT:    vins.f16 s12, s14
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bpl .LBB2_6
 ; CHECK-NEXT:  .LBB2_14: @ %cond.load7
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s21, [r0, #4]
-; CHECK-NEXT:    vmovx.f16 s22, s0
-; CHECK-NEXT:    vins.f16 s21, s22
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    vldr.16 s13, [r0, #4]
+; CHECK-NEXT:    vmovx.f16 s14, s0
+; CHECK-NEXT:    vins.f16 s13, s14
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bmi .LBB2_7
 ; CHECK-NEXT:    b .LBB2_8
 ; CHECK-NEXT:  .LBB2_15: @ %cond.load14
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s24, [r1]
-; CHECK-NEXT:    lsls r4, r2, #30
+; CHECK-NEXT:    vldr.16 s16, [r1]
+; CHECK-NEXT:    lsls r2, r3, #30
 ; CHECK-NEXT:    bpl .LBB2_10
 ; CHECK-NEXT:  .LBB2_16: @ %cond.load17
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s26, [r1, #2]
-; CHECK-NEXT:    vins.f16 s24, s26
-; CHECK-NEXT:    lsls r4, r2, #29
+; CHECK-NEXT:    vldr.16 s18, [r1, #2]
+; CHECK-NEXT:    vins.f16 s16, s18
+; CHECK-NEXT:    lsls r2, r3, #29
 ; CHECK-NEXT:    bpl .LBB2_11
 ; CHECK-NEXT:  .LBB2_17: @ %cond.load20
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s25, [r1, #4]
-; CHECK-NEXT:    vmovx.f16 s26, s0
-; CHECK-NEXT:    vins.f16 s25, s26
-; CHECK-NEXT:    lsls r2, r2, #28
+; CHECK-NEXT:    vldr.16 s17, [r1, #4]
+; CHECK-NEXT:    vmovx.f16 s18, s0
+; CHECK-NEXT:    vins.f16 s17, s18
+; CHECK-NEXT:    lsls r2, r3, #28
 ; CHECK-NEXT:    bpl.w .LBB2_2
 ; CHECK-NEXT:  .LBB2_18: @ %cond.load23
 ; CHECK-NEXT:    @ in Loop: Header=BB2_3 Depth=1
-; CHECK-NEXT:    vldr.16 s26, [r1, #6]
-; CHECK-NEXT:    vins.f16 s25, s26
+; CHECK-NEXT:    vldr.16 s18, [r1, #6]
+; CHECK-NEXT:    vins.f16 s17, s18
 ; CHECK-NEXT:    b .LBB2_2
 ; CHECK-NEXT:  .LBB2_19: @ %middle.block
-; CHECK-NEXT:    vdup.32 q0, r12
-; CHECK-NEXT:    vcmp.u32 cs, q0, q4
-; CHECK-NEXT:    vpsel q0, q5, q3
+; CHECK-NEXT:    vpsel q0, q3, q2
 ; CHECK-NEXT:    vmov.f32 s4, s2
 ; CHECK-NEXT:    vmov.f32 s5, s3
 ; CHECK-NEXT:    vadd.f32 q0, q0, q1
 ; CHECK-NEXT:    vmov r0, s1
 ; CHECK-NEXT:    vadd.f32 q0, q0, r0
 ; CHECK-NEXT:    add sp, #8
-; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpop {d8, d9}
 ; CHECK-NEXT:    pop {r4, r5, r7, pc}
 ; CHECK-NEXT:    .p2align 4
 ; CHECK-NEXT:  @ %bb.20:
diff --git a/llvm/test/CodeGen/X86/masked_gather.ll b/llvm/test/CodeGen/X86/masked_gather.ll
index 559a7ec0930b99..c3f46da299c2f8 100644
--- a/llvm/test/CodeGen/X86/masked_gather.ll
+++ b/llvm/test/CodeGen/X86/masked_gather.ll
@@ -1137,12 +1137,12 @@ define <16 x i8> @gather_v16i8_v16i32_v16i8(ptr %base, <16 x i32> %idx, <16 x i8
 define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-LABEL: gather_v8i32_v8i32:
 ; SSE:       # %bb.0:
-; SSE-NEXT:    movdqa %xmm1, %xmm3
 ; SSE-NEXT:    movdqa %xmm0, %xmm2
 ; SSE-NEXT:    pxor %xmm0, %xmm0
 ; SSE-NEXT:    pcmpeqd %xmm0, %xmm1
-; SSE-NEXT:    pcmpeqd %xmm2, %xmm0
-; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    pcmpeqd %xmm0, %xmm2
+; SSE-NEXT:    packssdw %xmm1, %xmm2
+; SSE-NEXT:    movdqa %xmm2, %xmm0
 ; SSE-NEXT:    packsswb %xmm0, %xmm0
 ; SSE-NEXT:    pmovmskb %xmm0, %eax
 ; SSE-NEXT:    testb $1, %al
@@ -1195,17 +1195,13 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:  .LBB4_17: # %cond.load19
 ; SSE-NEXT:    pinsrd $3, c+12(%rip), %xmm1
 ; SSE-NEXT:  .LBB4_18: # %else20
-; SSE-NEXT:    pxor %xmm4, %xmm4
-; SSE-NEXT:    movdqa %xmm2, %xmm5
-; SSE-NEXT:    pcmpeqd %xmm4, %xmm5
-; SSE-NEXT:    pcmpeqd %xmm3, %xmm4
-; SSE-NEXT:    packssdw %xmm4, %xmm5
-; SSE-NEXT:    packsswb %xmm5, %xmm5
-; SSE-NEXT:    pmovmskb %xmm5, %eax
+; SSE-NEXT:    movdqa %xmm2, %xmm3
+; SSE-NEXT:    packsswb %xmm3, %xmm3
+; SSE-NEXT:    pmovmskb %xmm3, %eax
 ; SSE-NEXT:    testb $1, %al
 ; SSE-NEXT:    je .LBB4_19
 ; SSE-NEXT:  # %bb.20: # %cond.load23
-; SSE-NEXT:    movd {{.*#+}} xmm4 = mem[0],zero,zero,zero
+; SSE-NEXT:    movd {{.*#+}} xmm3 = mem[0],zero,zero,zero
 ; SSE-NEXT:    testb $2, %al
 ; SSE-NEXT:    jne .LBB4_22
 ; SSE-NEXT:    jmp .LBB4_23
@@ -1215,11 +1211,11 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    jne .LBB4_17
 ; SSE-NEXT:    jmp .LBB4_18
 ; SSE-NEXT:  .LBB4_19:
-; SSE-NEXT:    # implicit-def: $xmm4
+; SSE-NEXT:    # implicit-def: $xmm3
 ; SSE-NEXT:    testb $2, %al
 ; SSE-NEXT:    je .LBB4_23
 ; SSE-NEXT:  .LBB4_22: # %cond.load28
-; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm3
 ; SSE-NEXT:  .LBB4_23: # %else31
 ; SSE-NEXT:    testb $4, %al
 ; SSE-NEXT:    jne .LBB4_24
@@ -1230,24 +1226,24 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    testb $16, %al
 ; SSE-NEXT:    je .LBB4_28
 ; SSE-NEXT:  .LBB4_29: # %cond.load43
-; SSE-NEXT:    pinsrd $0, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $0, c+28(%rip), %xmm4
 ; SSE-NEXT:    testb $32, %al
 ; SSE-NEXT:    jne .LBB4_31
 ; SSE-NEXT:    jmp .LBB4_32
 ; SSE-NEXT:  .LBB4_24: # %cond.load33
-; SSE-NEXT:    pinsrd $2, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $2, c+28(%rip), %xmm3
 ; SSE-NEXT:    testb $8, %al
 ; SSE-NEXT:    je .LBB4_27
 ; SSE-NEXT:  .LBB4_26: # %cond.load38
-; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm4
+; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm3
 ; SSE-NEXT:    testb $16, %al
 ; SSE-NEXT:    jne .LBB4_29
 ; SSE-NEXT:  .LBB4_28:
-; SSE-NEXT:    # implicit-def: $xmm5
+; SSE-NEXT:    # implicit-def: $xmm4
 ; SSE-NEXT:    testb $32, %al
 ; SSE-NEXT:    je .LBB4_32
 ; SSE-NEXT:  .LBB4_31: # %cond.load48
-; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $1, c+28(%rip), %xmm4
 ; SSE-NEXT:  .LBB4_32: # %else51
 ; SSE-NEXT:    testb $64, %al
 ; SSE-NEXT:    jne .LBB4_33
@@ -1255,12 +1251,8 @@ define <8 x i32> @gather_v8i32_v8i32(<8 x i32> %trigger) {
 ; SSE-NEXT:    testb $-128, %al
 ; SSE-NEXT:    je .LBB4_36
 ; SSE-NEXT:  .LBB4_35: # %cond.load58
-; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm5
+; SSE-NEXT:    pinsrd $3, c+28(%rip), %xmm4
 ; SSE-NEXT:  .LBB4_36: # %else61
-; SSE-NEXT:    pxor %xmm6, %xmm6
-; SSE-NEXT:    pcmpeqd %xmm6, %xmm2
-; SSE-NEXT:    pcmpeqd %xmm6, %xmm3
-; SSE-NEXT:    packssdw %xmm3, %xmm2
 ; SSE-NEXT:    packsswb %xmm2, %xmm2
 ; SSE-NEXT:    ...
[truncated]

SmallSet<User *, 4> Users;
for (auto *U : Cmp->users()) {
Instruction *User = cast<Instruction>(U);
if (isa<PHINode>(User))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep some full-user iteration below and also skip pseudo instructions that will be dropped during codegen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've reverted these changes and gone with a TLI hook instead.

@goldsteinn
Copy link
Contributor

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters.
Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

@davemgreen
Copy link
Collaborator

MVE has only a single vector predicate register available, so it might be better to do the sinking. Could you add a test case to check if it looks better?

@david-arm
Copy link
Contributor Author

david-arm commented Oct 24, 2024

MVE has only a single vector predicate register available, so it might be better to do the sinking. Could you add a test case to check if it looks better?

That's a fair point. I thought this was being tested by existing tests in llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll. I took a look at the results and they seemed to be around 1-2 instructions less overall, but that doesn't mean it's better of course. :) After chatting privately with @paulwalker-arm perhaps it's better to introduce a hasSingleVectorComparisonRegister or something like that? For NEON, SVE and presumably AVX this returns a number greater than one. For other targets where the hook returns 1 we sink the comparison. I'll play around with this and see where it leads ...

This at least would follow consistent logic for the scalar version.

@david-arm
Copy link
Contributor Author

Hi @davemgreen, I will also add a proper test for MVE similar to the one I added because I realised the existing Thumb2 test changes are for tail-folded loops, which is probably a bit different.

@davemgreen
Copy link
Collaborator

Yeah thanks. A lot of the tests look like expanded masked load/store, which are going to be a bit different to more normal codegen due to the scalarization. At least in some cases it might be better to scalarize the compare. Hopefully they shouldn't come up much in practice though.

@david-arm
Copy link
Contributor Author

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters. Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

Hmm, it looks like shouldNormalizeToSelectSequence is a virtual function overridden by AArch64 and is used by the DAGCombiner for a different problem. It looks like changing this function would be beyond the scope of this patch, however I am happy to investigate this as a follow-on if that's ok? Is the idea to avoid performing the DAG combine too if for vector comparisons the target has multiple vector predicate registers?

@goldsteinn
Copy link
Contributor

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?

This seems pretty similiar to existing shouldNormalizeToSelectSequence or hasMultipleConditionRegisters. Maybe shouldNormalizeToSelectSequence should become shouldSinkComparisons more generally (whether it be into selects or BBs).

Hmm, it looks like shouldNormalizeToSelectSequence is a virtual function overridden by AArch64 and is used by the DAGCombiner for a different problem. It looks like changing this function would be beyond the scope of this patch, however I am happy to investigate this as a follow-on if that's ok? Is the idea to avoid performing the DAG combine too if for vector comparisons the target has multiple vector predicate registers?

Sure, moreso what I meant was it seems we essentially have 2 existing flags that are related to avoiding transforms targeting single-predicate cmp ISAs. If you do choose to add a flags, it probably should co-opt or replace one of those.

/// Return true if multiple vector predicate registers are available.
bool hasMultipleVectorPredicateRegisters() const {
return HasMultipleVectorPredicateRegisters;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this should just take a type and be hasMultiplePredicateRegister(Type * Ty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've added an extra EVT argument because if I understood you correctly from your other comment you'd prefer me to have a single function called from both shouldNormalizeToSelectSequence and sinkCmpExpression. In this case shouldNormalizeToSelectSequence is in DAGCombiner and works with EVTs rather than Types.

@@ -2389,7 +2389,7 @@ class TargetLoweringBase {
EVT VT) const {
// If a target has multiple condition registers, then it likely has logical
// operations on those registers.
if (hasMultipleConditionRegisters())
if (hasMultiplePredicateRegisters(VT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at the use of this hook in DAGCombiner::visitSELECT, I think VT here is the type of X and Y. But hasMultiplePredicateRegisters should be passed the type of the predicate, which is hard coded to MVT::i1. (Please double check this since I find it quite confusing!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it would help if the comment on this function said what VT is supposed to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I think you might be right! I'd just assumed (probably wrongly) that VT was the result type. I've fixed this now.

@@ -387,6 +387,16 @@ class AMDGPUTargetLowering : public TargetLowering {
MVT getFenceOperandTy(const DataLayout &DL) const override {
return MVT::i32;
}

virtual bool hasMultiplePredicateRegisters(EVT VT) const override {
// FIXME: This is only partially true. If we have to do vector compares,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU part looks fine. In future we will probably want to address this FIXME by passing in a node, not just a type.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add these two tests too: https://godbolt.org/z/G1Ma7Gd9v. They are simpler and don't involve scalarization of the predicate (which will hopefully be rare).

@@ -1358,6 +1358,10 @@ class AArch64TargetLowering : public TargetLowering {
unsigned getMinimumJumpTableEntries() const override;

bool softPromoteHalfType() const override { return true; }

virtual bool hasMultiplePredicateRegisters(EVT VT) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+override

vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
%vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
%0 = icmp ult <4 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this 4 x i32, so that it is a legal type under MVE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although sadly this is genuinely what the loop vectoriser kicks out, which is why I wrote the test this way. It would be good to fix the vectoriser to choose a more sensible vector induction variable when you know the max range of possible element values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a C example? I'm surprised we would try and generate this for MVE, I would have expected the vector to be smaller and i64s to be fairly high cost. Maybe the scalars are too high too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course you're right. I think I took the output from the vectoriser for AArch64 and then tried to recreate a MVE test from it. If I compile

void foo(int *dest) {
  for (int i = 0; i < 14; i++) {
    dest[i] = 3;
  }
}

with

clang -O3 -fno-unroll-loops --target=thumb-linux-gnueabi -mcpu=cortex-m55 -mllvm -tail-predication=disabled -S -emit-llvm ./lowtrip.c

then I see

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %index, i64 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
  %vec.iv = or disjoint <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = icmp ult <4 x i32> %vec.iv, <i32 14, i32 14, i32 14, i32 14>
  %1 = getelementptr inbounds i32, ptr %dest, i32 %index
  tail call void @llvm.masked.store.v4i32.p0(<4 x i32> <i32 3, i32 3, i32 3, i32 3>, ptr %1, i32 4, <4 x i1> %0), !tbaa !5
  %index.next = add i32 %index, 4
  %2 = icmp eq i32 %index.next, 16
  br i1 %2, label %for.cond.cleanup, label %vector.body, !llvm.loop !9

I think a better example where it goes wrong is:

void foo(short *dest) {
  for (int i = 0; i < 14; i++) {
    dest[i] = 3;
  }
}

where the vectoriser output is:

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <8 x i32> poison, i32 %index, i64 0
  %broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> poison, <8 x i32> zeroinitializer
  %vec.iv = or disjoint <8 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %0 = icmp ult <8 x i32> %vec.iv, <i32 14, i32 14, i32 14, i32 14, i32 14, i32 14, i32 14, i32 14>
  %1 = getelementptr inbounds i16, ptr %dest, i32 %index
  tail call void @llvm.masked.store.v8i16.p0(<8 x i16> <i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3, i16 3>, ptr %1, i32 2, <8 x i1> %0), !tbaa !5
  %index.next = add i32 %index, 8
  %2 = icmp eq i32 %index.next, 16
  br i1 %2, label %for.cond.cleanup, label %vector.body, !llvm.loop !9

and this time the <8 x i32> isn't ideal - we could in fact have chosen <8 x i16>. Anyway, for the purpose of this particular PR it doesn't matter I suppose. I'll amend the test to use i32!

@@ -0,0 +1,136 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve.fp,+fp-armv8d16sp,+fp16,+fullfp16 < %s | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-mattr=+mve I think should be enough

@david-arm
Copy link
Contributor Author

Can you add these two tests too: https://godbolt.org/z/G1Ma7Gd9v. They are simpler and don't involve scalarization of the predicate (which will hopefully be rare).

Damn, just realised I missed this. I'll take a look now.

Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.
@david-arm
Copy link
Contributor Author

Added new tests

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay I was going through things in a strange order. I had previously only checked the test on AArch64. Do we need to worry about the extra xtn/sshll this introduces? I was going to suggest we just remove them post-isel, but that might require getting the top bits correct which requires a lot of infrastructure that currently doesn't exist.

@david-arm
Copy link
Contributor Author

Sorry for the delay I was going through things in a strange order. I had previously only checked the test on AArch64. Do we need to worry about the extra xtn/sshll this introduces? I was going to suggest we just remove them post-isel, but that might require getting the top bits correct which requires a lot of infrastructure that currently doesn't exist.

Do you mean for the no_sink_simple case? So the typical use case that we see here is when vectorising a loop with a condition and having predicated blocks that depend upon each element of the result. So when there are 4 or more uses to sink (e.g. vector_loop_with_icmp) this definitely gives some significant performance gains, which was the motivation for this patch. You could be right that this isn't worth it when there are only two uses, but I'm not sure how to account for that. If possible I'd prefer to keep the hasMultiplePredicateRegisters function as it is because I think it's more useful as a standalone hook rather than being tied to the specifics of sinking a comparison instruction. It looks like sinkCmpExpression was never originally intended for sinking vector instructions, and presumably this just started happening by accident for vectors without necessarily doing a cost-benefit analysis. The only clean way I can think of to work around this is by introducing a brand new TLI hook also called shouldConsiderSinkingCmp that checks if the number of uses of a fixed-width vector are > 2, e.g.

static bool sinkCmpExpression(const DataLayout &DL, CmpInst *Cmp,
                              const TargetLowering &TLI) {
  EVT ResVT = TLI.getValueType(DL, Cmp->getType());
  if (!TLI.shouldConsiderSinkingCmp(Cmp) || TLI.hasMultiplePredicateRegisters(ResVT))
    return false;

  // Avoid sinking soft-FP comparisons, since this can move them into a loop.
  if (TLI.useSoftFloat() && isa<FCmpInst>(Cmp))
    return false;

where shouldConsiderSinkingCmp returns true in the default case and for AArch64 does something like this:

bool shouldConsiderSinkingCmp(CmpInst *Cmp) {
  auto VecTy = dyn_cast<FixedVectorType>(Cmp->getType());
  return !VecTy || Cmp->getNumUsers() > 2;
}

What do you think? Or it could go one step further and look at the users to see if they are extractelement or not, i.e. return false if and only if it's a 2-element vector where each user is an extract?

@davemgreen
Copy link
Collaborator

I believe it is less about the number of uses and more about the type legalization of the uses. v4i1 gets legalized to v4i16, which will insert xtn and sshll if the original instructions were v4i32. The cross-basic block nature makes it difficult for DAG to deal with. If the uses are extracts then the problem is lessened as extracts can always use the type it sees.

I gave it a test though and it seemed OK, so LGTM. We might want to look at removing the legalization artefacts too.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-arm, sorry for the late review but when looking at the tests I think I've misunderstood our original conversation. While hasMultipleConditionRegisters seems like an overly specific question for the affected code, on reflection I'm not sure that matters.

Looking at vector_loop_with_icmp I can see that not sinking the compare improves the output, but isn't the original output only bad because the code generator is missing an obvious extractelt (setcc X, C), IndexC => setcc (extractelt X, IndexC), C combine? which would be better all-round.

@david-arm
Copy link
Contributor Author

Hi @david-arm, sorry for the late review but when looking at the tests I think I've misunderstood our original conversation. While hasMultipleConditionRegisters seems like an overly specific question for the affected code, on reflection I'm not sure that matters.

Looking at vector_loop_with_icmp I can see that not sinking the compare improves the output, but isn't the original output only bad because the code generator is missing an obvious extractelt (setcc X, C), IndexC => setcc (extractelt X, IndexC), C combine? which would be better all-round.

I'm happy to have a look at your suggestion and I can believe it helps, but not sure that in general it's a good idea to be sinking unless you can prove the vector comparison will disappear? Anyway, I can try the codegen first and see if this patch still helps on top of it.

@david-arm
Copy link
Contributor Author

I've followed the suggestion by @paulwalker-arm and I will abandon this patch in favour of #116031

@david-arm david-arm closed this Dec 12, 2024
@david-arm david-arm deleted the no_sink_vcmps branch January 28, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants