[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

john-brawn-arm · 2024-09-19T14:19:17Z

When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions.

Fixes #66652

…lding When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions. Fixes llvm#66652

llvmbot · 2024-09-19T14:19:56Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: John Brawn (john-brawn-arm)

Changes

When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions.

Fixes #66652

Patch is 87.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109289.diff

5 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+8-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+43-302)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+12-304)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll (+24-6)
(modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+182-119)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 107fb38be31969..c64f8ccb3a84f7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5501,10 +5501,14 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
     // Scale the total scalar cost by block probability.
     ScalarCost /= getReciprocalPredBlockProb();
 
-    // Compute the discount. A non-negative discount means the vector version
-    // of the instruction costs more, and scalarizing would be beneficial.
-    Discount += VectorCost - ScalarCost;
-    ScalarCosts[I] = ScalarCost;
+    // Compute the discount, unless this instruction must be scalarized due to
+    // tail folding, as then the vector cost is already the scalar cost. A
+    // non-negative discount means the vector version of the instruction costs
+    // more, and scalarizing would be beneficial.
+    if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) {
+      Discount += VectorCost - ScalarCost;
+      ScalarCosts[I] = ScalarCost;
+    }
   }
 
   return Discount;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index 9910be7224674c..de1cae7383f863 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -529,93 +529,14 @@ define void @latch_branch_cost(ptr %dst) {
 ; PRED-LABEL: define void @latch_branch_cost(
 ; PRED-SAME: ptr [[DST:%.*]]) {
 ; PRED-NEXT:  entry:
-; PRED-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; PRED:       vector.ph:
-; PRED-NEXT:    br label [[VECTOR_BODY:%.*]]
-; PRED:       vector.body:
-; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
-; PRED-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]
-; PRED-NEXT:    [[TMP0:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99>
-; PRED-NEXT:    [[TMP1:%.*]] = extractelement <8 x i1> [[TMP0]], i32 0
-; PRED-NEXT:    br i1 [[TMP1]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; PRED:       pred.store.if:
-; PRED-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP2]]
-; PRED-NEXT:    store i8 0, ptr [[TMP3]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE]]
-; PRED:       pred.store.continue:
-; PRED-NEXT:    [[TMP4:%.*]] = extractelement <8 x i1> [[TMP0]], i32 1
-; PRED-NEXT:    br i1 [[TMP4]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; PRED:       pred.store.if1:
-; PRED-NEXT:    [[TMP5:%.*]] = add i64 [[INDEX]], 1
-; PRED-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP5]]
-; PRED-NEXT:    store i8 0, ptr [[TMP6]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE2]]
-; PRED:       pred.store.continue2:
-; PRED-NEXT:    [[TMP7:%.*]] = extractelement <8 x i1> [[TMP0]], i32 2
-; PRED-NEXT:    br i1 [[TMP7]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; PRED:       pred.store.if3:
-; PRED-NEXT:    [[TMP8:%.*]] = add i64 [[INDEX]], 2
-; PRED-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP8]]
-; PRED-NEXT:    store i8 0, ptr [[TMP9]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE4]]
-; PRED:       pred.store.continue4:
-; PRED-NEXT:    [[TMP10:%.*]] = extractelement <8 x i1> [[TMP0]], i32 3
-; PRED-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE7:%.*]]
-; PRED:       pred.store.if5:
-; PRED-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 3
-; PRED-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; PRED-NEXT:    store i8 0, ptr [[TMP12]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE7]]
-; PRED:       pred.store.continue6:
-; PRED-NEXT:    [[TMP13:%.*]] = extractelement <8 x i1> [[TMP0]], i32 4
-; PRED-NEXT:    br i1 [[TMP13]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; PRED:       pred.store.if7:
-; PRED-NEXT:    [[TMP14:%.*]] = add i64 [[INDEX]], 4
-; PRED-NEXT:    [[TMP15:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
-; PRED-NEXT:    store i8 0, ptr [[TMP15]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE8]]
-; PRED:       pred.store.continue8:
-; PRED-NEXT:    [[TMP16:%.*]] = extractelement <8 x i1> [[TMP0]], i32 5
-; PRED-NEXT:    br i1 [[TMP16]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; PRED:       pred.store.if9:
-; PRED-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], 5
-; PRED-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP17]]
-; PRED-NEXT:    store i8 0, ptr [[TMP18]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE10]]
-; PRED:       pred.store.continue10:
-; PRED-NEXT:    [[TMP19:%.*]] = extractelement <8 x i1> [[TMP0]], i32 6
-; PRED-NEXT:    br i1 [[TMP19]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; PRED:       pred.store.if11:
-; PRED-NEXT:    [[TMP20:%.*]] = add i64 [[INDEX]], 6
-; PRED-NEXT:    [[TMP21:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP20]]
-; PRED-NEXT:    store i8 0, ptr [[TMP21]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE12]]
-; PRED:       pred.store.continue12:
-; PRED-NEXT:    [[TMP22:%.*]] = extractelement <8 x i1> [[TMP0]], i32 7
-; PRED-NEXT:    br i1 [[TMP22]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE6]]
-; PRED:       pred.store.if13:
-; PRED-NEXT:    [[TMP23:%.*]] = add i64 [[INDEX]], 7
-; PRED-NEXT:    [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; PRED-NEXT:    store i8 0, ptr [[TMP24]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE6]]
-; PRED:       pred.store.continue14:
-; PRED-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; PRED-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; PRED-NEXT:    [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], 104
-; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; PRED:       middle.block:
-; PRED-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
-; PRED:       scalar.ph:
-; PRED-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 104, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
 ; PRED-NEXT:    br label [[FOR_BODY:%.*]]
 ; PRED:       loop:
-; PRED-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; PRED-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
 ; PRED-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
 ; PRED-NEXT:    store i8 0, ptr [[GEP]], align 1
 ; PRED-NEXT:    [[INDVARS_IV_NEXT]] = add i64 [[IV]], 1
 ; PRED-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
-; PRED-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; PRED-NEXT:    br i1 [[EXITCOND_NOT]], label [[EXIT:%.*]], label [[FOR_BODY]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret void
 ;
@@ -800,27 +721,27 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; PRED-NEXT:    [[TMP15:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT:    [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META6:![0-9]+]]
+; PRED-NEXT:    [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META4:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP16]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT29:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT28]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
-; PRED-NEXT:    [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META9:![0-9]+]]
+; PRED-NEXT:    [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META7:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP17]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
 ; PRED-NEXT:    [[TMP18:%.*]] = or <vscale x 4 x i32> [[BROADCAST_SPLAT]], [[BROADCAST_SPLAT29]]
-; PRED-NEXT:    [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META11:![0-9]+]]
+; PRED-NEXT:    [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META9:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT30:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP19]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT31:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT30]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
 ; PRED-NEXT:    [[TMP20:%.*]] = icmp ugt <vscale x 4 x i32> [[BROADCAST_SPLAT31]], [[TMP18]]
 ; PRED-NEXT:    [[TMP21:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP20]], <vscale x 4 x i1> zeroinitializer
 ; PRED-NEXT:    [[TMP22:%.*]] = getelementptr i32, ptr [[D]], i64 [[TMP15]]
-; PRED-NEXT:    call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META13:![0-9]+]], !noalias [[META15:![0-9]+]]
+; PRED-NEXT:    call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META11:![0-9]+]], !noalias [[META13:![0-9]+]]
 ; PRED-NEXT:    [[TMP23:%.*]] = getelementptr i32, ptr [[TMP22]], i32 0
-; PRED-NEXT:    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META17:![0-9]+]], !noalias [[META18:![0-9]+]]
+; PRED-NEXT:    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META15:![0-9]+]], !noalias [[META16:![0-9]+]]
 ; PRED-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP14]])
 ; PRED-NEXT:    [[TMP24:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
 ; PRED-NEXT:    [[TMP25:%.*]] = extractelement <vscale x 4 x i1> [[TMP24]], i32 0
-; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
+; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP17:![0-9]+]]
 ; PRED:       middle.block:
 ; PRED-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; PRED:       scalar.ph:
@@ -842,7 +763,7 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; PRED:       loop.latch:
 ; PRED-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
 ; PRED-NEXT:    [[C_1:%.*]] = icmp eq i64 [[IV]], [[N]]
-; PRED-NEXT:    br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP20:![0-9]+]]
+; PRED-NEXT:    br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP18:![0-9]+]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret i32 0
 ;
@@ -959,7 +880,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX]], i64 [[TMP10]])
 ; PRED-NEXT:    [[TMP16:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)
 ; PRED-NEXT:    [[TMP17:%.*]] = extractelement <vscale x 2 x i1> [[TMP16]], i32 0
-; PRED-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
+; PRED-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
 ; PRED:       middle.block:
 ; PRED-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; PRED:       scalar.ph:
@@ -979,7 +900,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; PRED-NEXT:    [[PTR_IV_NEXT]] = getelementptr i8, ptr [[PTR_IV]], i64 8
 ; PRED-NEXT:    [[IV_CLAMP:%.*]] = and i64 [[IV]], 4294967294
 ; PRED-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_CLAMP]], 512
-; PRED-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP22:![0-9]+]]
+; PRED-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP20:![0-9]+]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret void
 ;
@@ -1003,208 +924,34 @@ exit:
   ret void
 }
 
-define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
-; DEFAULT-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+define void @low_trip_count_store(ptr %dst) {
+; DEFAULT-LABEL: define void @low_trip_count_store(
 ; DEFAULT-SAME: ptr [[DST:%.*]]) {
 ; DEFAULT-NEXT:  entry:
-; DEFAULT-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; DEFAULT:       vector.ph:
-; DEFAULT-NEXT:    br label [[VECTOR_BODY:%.*]]
-; DEFAULT:       vector.body:
-; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE14:%.*]] ]
-; DEFAULT-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE14]] ]
-; DEFAULT-NEXT:    [[TMP0:%.*]] = trunc i64 [[INDEX]] to i8
-; DEFAULT-NEXT:    [[TMP1:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6>
-; DEFAULT-NEXT:    [[TMP2:%.*]] = extractelement <8 x i1> [[TMP1]], i32 0
-; DEFAULT-NEXT:    br i1 [[TMP2]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; DEFAULT:       pred.store.if:
-; DEFAULT-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 0
-; DEFAULT-NEXT:    [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP3]]
-; DEFAULT-NEXT:    [[TMP5:%.*]] = add i8 [[TMP0]], 0
-; DEFAULT-NEXT:    store i8 [[TMP5]], ptr [[TMP4]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE]]
-; DEFAULT:       pred.store.continue:
-; DEFAULT-NEXT:    [[TMP6:%.*]] = extractelement <8 x i1> [[TMP1]], i32 1
-; DEFAULT-NEXT:    br i1 [[TMP6]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; DEFAULT:       pred.store.if1:
-; DEFAULT-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 1
-; DEFAULT-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP7]]
-; DEFAULT-NEXT:    [[TMP9:%.*]] = add i8 [[TMP0]], 1
-; DEFAULT-NEXT:    store i8 [[TMP9]], ptr [[TMP8]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE2]]
-; DEFAULT:       pred.store.continue2:
-; DEFAULT-NEXT:    [[TMP10:%.*]] = extractelement <8 x i1> [[TMP1]], i32 2
-; DEFAULT-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; DEFAULT:       pred.store.if3:
-; DEFAULT-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 2
-; DEFAULT-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; DEFAULT-NEXT:    [[TMP13:%.*]] = add i8 [[TMP0]], 2
-; DEFAULT-NEXT:    store i8 [[TMP13]], ptr [[TMP12]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE4]]
-; DEFAULT:       pred.store.continue4:
-; DEFAULT-NEXT:    [[TMP14:%.*]] = extractelement <8 x i1> [[TMP1]], i32 3
-; DEFAULT-NEXT:    br i1 [[TMP14]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
-; DEFAULT:       pred.store.if5:
-; DEFAULT-NEXT:    [[TMP15:%.*]] = add i64 [[INDEX]], 3
-; DEFAULT-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP15]]
-; DEFAULT-NEXT:    [[TMP17:%.*]] = add i8 [[TMP0]], 3
-; DEFAULT-NEXT:    store i8 [[TMP17]], ptr [[TMP16]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE6]]
-; DEFAULT:       pred.store.continue6:
-; DEFAULT-NEXT:    [[TMP18:%.*]] = extractelement <8 x i1> [[TMP1]], i32 4
-; DEFAULT-NEXT:    br i1 [[TMP18]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; DEFAULT:       pred.store.if7:
-; DEFAULT-NEXT:    [[TMP19:%.*]] = add i64 [[INDEX]], 4
-; DEFAULT-NEXT:    [[TMP20:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP19]]
-; DEFAULT-NEXT:    [[TMP21:%.*]] = add i8 [[TMP0]], 4
-; DEFAULT-NEXT:    store i8 [[TMP21]], ptr [[TMP20]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE8]]
-; DEFAULT:       pred.store.continue8:
-; DEFAULT-NEXT:    [[TMP22:%.*]] = extractelement <8 x i1> [[TMP1]], i32 5
-; DEFAULT-NEXT:    br i1 [[TMP22]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; DEFAULT:       pred.store.if9:
-; DEFAULT-NEXT:    [[TMP23:%.*]] = add i64 [[INDEX]], 5
-; DEFAULT-NEXT:    [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; DEFAULT-NEXT:    [[TMP25:%.*]] = add i8 [[TMP0]], 5
-; DEFAULT-NEXT:    store i8 [[TMP25]], ptr [[TMP24]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE10]]
-; DEFAULT:       pred.store.continue10:
-; DEFAULT-NEXT:    [[TMP26:%.*]] = extractelement <8 x i1> [[TMP1]], i32 6
-; DEFAULT-NEXT:    br i1 [[TMP26]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; DEFAULT:       pred.store.if11:
-; DEFAULT-NEXT:    [[TMP27:%.*]] = add i64 [[INDEX]], 6
-; DEFAULT-NEXT:    [[TMP28:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP27]]
-; DEFAULT-NEXT:    [[TMP29:%.*]] = add i8 [[TMP0]], 6
-; DEFAULT-NEXT:    store i8 [[TMP29]], ptr [[TMP28]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE12]]
-; DEFAULT:       pred.store.continue12:
-; DEFAULT-NEXT:    [[TMP30:%.*]] = extractelement <8 x i1> [[TMP1]], i32 7
-; DEFAULT-NEXT:    br i1 [[TMP30]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE14]]
-; DEFAULT:       pred.store.if13:
-; DEFAULT-NEXT:    [[TMP31:%.*]] = add i64 [[INDEX]], 7
-; DEFAULT-NEXT:    [[TMP32:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP31]]
-; DEFAULT-NEXT:    [[TMP33:%.*]] = add i8 [[TMP0]], 7
-; DEFAULT-NEXT:    store i8 [[TMP33]], ptr [[TMP32]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE14]]
-; DEFAULT:       pred.store.continue14:
-; DEFAULT-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; DEFAULT-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; DEFAULT-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
-; DEFAULT:       middle.block:
-; DEFAULT-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
-; DEFAULT:       scalar.ph:
-; DEFAULT-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
 ; DEFAULT-NEXT:    br label [[LOOP:%.*]]
 ; DEFAULT:       loop:
-; DEFAULT-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; DEFAULT-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
 ; DEFAULT-NEXT:    [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i8
 ; DEFAULT-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
 ; DEFAULT-NEXT:    store i8 [[IV_TRUNC]], ptr [[GEP]], align 1
 ; DEFAULT-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
 ; DEFAULT-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
-; DEFAULT-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP27:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[EC]], label [[EXIT:%.*]], label [[LOOP]]
 ; DEFAULT:       exit:
 ; DEFAULT-NEXT:    ret void
 ;
-; PRED-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+; PRED-LABEL: define void @low_trip_count_store(
 ; PRED-SAME: ptr [[DST:%.*]]) {
 ; PRED-NEXT:  entry:
-; PRED-NEXT:    br i1 false, label [[SCALAR_PH...
[truncated]

david-arm · 2024-09-27T15:05:43Z

llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll

 ; CHECK-NEXT:     EMIT vp<[[CMP:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]>
+; CHECK-NEXT:   Successor(s): pred.load


At first glance this looks worse than before, unless I'm missing something. It looks like previously we were reusing the same predicated blocks to perform both the load and store, i.e. something like

%cmp = icmp ... <4 x i32> %lane0 = extractelement <4 x i32> %cmp, i32 0 br i1 %lane0, label %block1.if, label %block1.continue %block1.if: .. do load .. .. do store .. ...

whereas now we've essentially split out the loads and stores with duplicate control flow, i.e.

%cmp = icmp ... <4 x i32> %lane0 = extractelement <4 x i32> %cmp, i32 0 br i1 %lane0, label %block1.load.if, label %block1.load.continue %block1.if: .. do load .. ... %stores: %lane0.1 = extractelement <4 x i32> %cmp, i32 0 br i1 %lane0.1, label %block1.store.if, label %block1.store.continue ...

I'd expect the extra control flow to hurt performance.

This test is a bit strange as both before and after this patch this VPlan isn't used, as it decides the scalar version costs less. As to why it decides this VPlan is the best for this vector length, computePredInstDiscount doesn't currently account for any control flow instructions it could be able to remove by scalarizing everything, though I don't know how it would do that, as it's making a decision about whether to scalarize the chain of instructions feeding into a store, and the chose to scalarize the loads happens elsewhere.

Perhaps the best thing to do here is to change the test to check that it's making the correct decision, which is not to vectorize.

david-arm · 2024-09-27T15:37:17Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // tail folding, as then the vector cost is already the scalar cost. A
+    // non-negative discount means the vector version of the instruction costs
+    // more, and scalarizing would be beneficial.
+    if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) {


I must admit I'm not too familiar with this code and I need some time to understand what effect this change has on the costs, but I'll take a deeper look next week!

OK so this patch is saying that if block needs predication due to tail-folding and we've already decided to scalarise the instruction for the vector VF, then we shouldn't apply a discount. However, it feels like there two problems with this:

I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?

In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?

I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.

I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?

This is possibly true, but I tried that and there's a lot more test changes as a result, and briefly looking at them it wasn't immediately obvious if they were better or worse.

In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?

Looking at low_trip_count_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll the current sequence of events that happens is (for a vector length of 4):

getMemInstScalarizationCost is called on the store, which needs to be scalarized and predicated because of tail masking, and calculates a cost of 20, which includes the cost of doing the compare and branch into the predicated block.

computePredInstDiscount uses this as the VectorCost, as it uses getInstructionCost which for scalarized instruction returns the cost value in InstsToScalarize

The calculation of ScalarCost assumes that the predicated block already exists, so it just calculates the cost of moving the instruction into the predicated block and scalarizing it in there, which it calculates as 4 (for 4 copies of the store)

This results in a discount of 16, causing 20-16=4 to be used as the cost of the scalarized store.

ScalarCost is too low for the scalarized store, because it's assuming that the predicated block already exists, but the cost of the predicated block is exactly what we need to take into account to avoid pointless tail folding by masking.

I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.

I'll add these tests.

david-arm · 2024-10-03T08:17:11Z

llvm/test/Analysis/CostModel/AArch64/tail-folding.ll

+; CHECK:  LV: Vector loop of width 4 costs: 4.
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+; CHECK:  LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
+; CHECK:  LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1


Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1. I suspect what your change does is increase the effective cost of the store by saying that we shouldn't scalarise it. However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1

Current costs, without this patch:

LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] LV: Found an estimated cost of 0 for VF 1 For instruction: %iv.trunc = trunc i64 %iv to i8 LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 %iv.trunc, ptr %gep, align 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop LV: Scalar loop costs: 4. LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] LV: Found an estimated cost of 2 for VF 2 For instruction: %iv.trunc = trunc i64 %iv to i8 LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %iv.trunc, ptr %gep, align 1 LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1 LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop LV: Vector loop of width 2 costs: 3. LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] LV: Found an estimated cost of 4 for VF 4 For instruction: %iv.trunc = trunc i64 %iv to i8 LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv LV: Found an estimated cost of 4 for VF 4 For instruction: store i8 %iv.trunc, ptr %gep, align 1 LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1 LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7 LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop LV: Vector loop of width 4 costs: 2. LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] LV: Found an estimated cost of 8 for VF 8 For instruction: %iv.trunc = trunc i64 %iv to i8 LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv LV: Found an estimated cost of 8 for VF 8 For instruction: store i8 %iv.trunc, ptr %gep, align 1 LV: Found an estimated cost of 4 for VF 8 For instruction: %iv.next = add i64 %iv, 1 LV: Found an estimated cost of 1 for VF 8 For instruction: %ec = icmp eq i64 %iv.next, 7 LV: Found an estimated cost of 0 for VF 8 For instruction: br i1 %ec, label %exit, label %loop LV: Vector loop of width 8 costs: 2. LV: Found an estimated cost of 0 for VF 16 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] LV: Found an estimated cost of 16 for VF 16 For instruction: %iv.trunc = trunc i64 %iv to i8 LV: Found an estimated cost of 0 for VF 16 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv LV: Found an estimated cost of 16 for VF 16 For instruction: store i8 %iv.trunc, ptr %gep, align 1 LV: Found an estimated cost of 8 for VF 16 For instruction: %iv.next = add i64 %iv, 1 LV: Found an estimated cost of 1 for VF 16 For instruction: %ec = icmp eq i64 %iv.next, 7 LV: Found an estimated cost of 0 for VF 16 For instruction: br i1 %ec, label %exit, label %loop LV: Vector loop of width 16 costs: 2. LV: Selecting VF: 8.

However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

That sounds like doing more work for the same end result? We already know that there can be no discount, because the "vector cost" for an instruction we've already decided to scalarize is already the scalarized cost, so there's no point in calculating the same thing again.

Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

If we've already decided to scalarize by the time we've called computePredInstDiscount (i.e. getWideningDecision returns CM_Scalarize) then that means using a vector instruction is impossible (because no vector instruction exists for the operation we're trying to do, in this case we're not using sve so we don't have a masked vector store instruction).

So the end result may be the same, but by fixing the cost model we hopefully fix one of the root causes and potentially fix other hidden problems that haven't yet come to light.

Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.

One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see isLegalMaskedStore and isLegalMaskedLoad in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?

Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.

It appears it used to be the case that not having masked loads and stores disabled tail folding, but 9d1b2acaaa224 changed it to using a cost model instead.

One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see isLegalMaskedStore and isLegalMaskedLoad in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?

TargetTransformInfo::preferPredicateOverEpilogue already defaults to preferring a scalar epilogue, and the two targets that implement their own version (AArch64 and ARM) prefer scalar epilogue when we don't have masked loads and stores. So the only situation where we're even considering using tail predication if we don't have masked loads/stores is when using a scalar epilogue is impossible, so the choice is between vectorizing using tail folding or not vectorizing at all.

Hi, I don't actually mean changing preferPredicateOverEpilogue because this is a top level decision that's used in general for all loops, regardless of the trip count, code size concerns due to -Os/-Oz. I meant the code where we decide to vectorise due to the low trip count, which can still happen even when preferPredicateOverEpilogue returns false. For example, if you look in the function LoopVectorizationCostModel::computeMaxVF at this code here:

// If we don't know the precise trip count, or if the trip count that we // found modulo the vectorization factor is not zero, try to fold the tail // by masking. // FIXME: look for a smaller MaxVF that does divide TC rather than masking. setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC); if (foldTailByMasking()) {

I added this extra code just before:

if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedLowTripLoop) { for (BasicBlock *BB : TheLoop->getBlocks()) { for (Instruction &I : *BB) { if (isa<LoadInst>(&I) || isa<StoreInst>(&I)) { auto *Ptr = getLoadStorePointerOperand(&I); auto *Ty = getLoadStoreType(&I); const Align Alignment = getLoadStoreAlignment(&I); if ((isa<LoadInst>(&I) && !isLegalMaskedLoad(Ty, Ptr, Alignment)) || (isa<StoreInst>(&I) && !isLegalMaskedStore(Ty, Ptr, Alignment))) { LLVM_DEBUG(dbgs() << "LV: Not tail-folding due to lack of masked load/store support\n"); // We could also just return FixedScalableVFPair::getNone() here. ScalarEpilogueStatus = CM_ScalarEpilogueAllowed; return MaxFactors; } } } } }

And now the code for @store_const_fixed_trip_count looks something like this:

define void @store_const_fixed_trip_count(ptr %dst) { entry: br i1 false, label %scalar.ph, label %vector.ph vector.ph: ; preds = %entry br label %vector.body vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %0 = add i64 %index, 0 %1 = getelementptr i8, ptr %dst, i64 %0 %2 = getelementptr i8, ptr %1, i32 0 store <4 x i8> <i8 1, i8 1, i8 1, i8 1>, ptr %2, align 1 %index.next = add nuw i64 %index, 4 %3 = icmp eq i64 %index.next, 4 br i1 %3, label %middle.block, label %vector.body, !llvm.loop !0 middle.block: ; preds = %vector.body br i1 false, label %exit, label %scalar.ph scalar.ph: ; preds = %middle.block, %entry %bc.resume.val = phi i64 [ 4, %middle.block ], [ 0, %entry ] br label %loop loop: ; preds = %loop, %scalar.ph %iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %iv.next, %loop ] %gep = getelementptr i8, ptr %dst, i64 %iv store i8 1, ptr %gep, align 1 %iv.next = add i64 %iv, 1 %ec = icmp eq i64 %iv.next, 7 br i1 %ec, label %exit, label %loop, !llvm.loop !3 exit: ; preds = %middle.block, %loop ret void }

The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.

IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.

This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.

Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?

The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.

There are situations where tail-folding is faster than not vectorizing, e.g. the following example

__attribute__((noinline)) void fn(char *p, char a, char b, char c) { for (char i = 0; i < 7; ++i) { p[i] = i*a + (i>>1)*b + (i>>2)*c; } } int main() { char arr[256]; fn(arr, 3, 5, 7); return arr[0]; }

On cortex-a53 compiling with -O3 -fno-loop-unroll, fn takes 65 cycles if we vectorize using tail folding and 74 cycles if we don't vectorize. I don't know how realistic this is as an example though, as without loop unrolling disabled it gets unrolled and that takes 43 cycles, but it does show that there are situations where just disabling tail folding outright leads to a worse result.

Also in general I think it's better to make decisions based on calculating the cost of things if we can, unless we can guarantee that a certain decision is definitely going to be worse, which we can't do here. I also don't think the change I'm making to computePredInstDiscount is a workaround, as what it's doing is making the cost we calculate more accurate to the actual cost of doing this, and more accurate costs should lead to better decisions.

IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.

This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.

Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?

Perhaps, but there's a general problem that the current cost calculations for Os are just wrong: we use TCK_RecipThroughput everywhere, and for predicated instructions we scale the cost by the block probability. I'll be working on a patch after this to try and fix at least some of this.

david-arm · 2024-10-18T13:03:05Z

I compiled these tests that all end up using predicated blocks, using the command "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289.c":

void lowtrip(short *dest) {
  for (int i = 0; i < 15; i++) {
    dest[i] = 1;
  }
}

void conditional(short *dest, int n) {
  for (int i = 0; i < 16; i++) {
    if (i & 0x1)
      dest[i] = 1;  
  }
}

Compile this test with "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289_optsize.c":

void optsize(short *dest, short val, int n) {
  for (int i = 0; i < n; i++) { 
    dest[i] = val;
  }
}

The IR before the loop vectoriser looks like this:

== pr109289.c ==

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %for.body
 
for.cond.cleanup:                                 ; preds = %for.body
  ret void

for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 15
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable
define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.inc
  ret void
  
for.body:                                         ; preds = %entry, %for.inc
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
  %and6 = and i64 %indvars.iv, 1
  %tobool.not = icmp eq i64 %and6, 0
  br i1 %tobool.not, label %for.inc, label %if.then
  
if.then:                                          ; preds = %for.body
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  br label %for.inc
  
for.inc:                                          ; preds = %for.body, %if.then
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}
  
attributes #0 = { "target-cpu"="generic" }

== pr109289_optsize.c ==

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"
  
; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @optsize(ptr nocapture noundef writeonly %dest, i16 noundef %val, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %cmp3 = icmp sgt i32 %n, 0
  br i1 %cmp3, label %for.body.preheader, label %for.cond.cleanup   
  
for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext nneg i32 %n to i64
  br label %for.body
  
for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup
 
for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret void

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 %val, ptr %arrayidx, align 2   
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
}

attributes #0 = { optsize "target-cpu"="generic" }

In all 3 tests the vectoriser decides to vectorise as follows:

== lowtrip ==
LV: Not allowing scalar epilogue due to low trip count.
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 1
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 1
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 1
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1

== conditional ==
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 4 for VF 2 For instruction: br i1
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 1
LV: Vector loop of width 2 costs: 5.
LV: Found an estimated cost of 8 for VF 4 For instruction: br i1
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 1
LV: Vector loop of width 4 costs: 4.
LV: Found an estimated cost of 16 for VF 8 For instruction: br i1
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 1
LV: Vector loop of width 8 costs: 4.
LV: Selecting VF: 1.

== optsize ==
LV: Not allowing scalar epilogue due to -Os/-Oz.
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %val
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %val
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1

In all three cases we have made the wrong decision based on the currently generated code. These are some example runtimes with vectorisation of all 3 (I had to force vectorisation of conditional) on neoverse-v1 (I ran each function in a loop many times and timed them):

lowtrip: 50 ms
conditional: 31 ms
optsize: 47 ms

Without vectorisation on neoverse-v1:

lowtrip: 26 ms
conditional: 45 ms
optsize: 26 ms

On the surface it appears for lowtrip and optsize we shouldn't be vectorising, but for conditional we should. When I dug into this it seems that although conditional should suffer the same problem as lowtrip and optsize, the codegen is actually much better. This is the final IR for lowtrip and conditional:

define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue20 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue
  
...

You can see what's happened after the CodeGenPrepare pass for each test:

define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue18 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue18 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0   
  br i1 %1, label %pred.store.if, label %pred.store.continue
  
pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue

pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %3 = extractelement <8 x i1> %2, i64 1
  br i1 %3, label %pred.store.if5, label %pred.store.continue6

pred.store.if5:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue6

pred.store.continue6:                             ; preds = %pred.store.if5, %pred.store.continue
  %4 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %5 = extractelement <8 x i1> %4, i64 2
  br i1 %5, label %pred.store.if7, label %pred.store.continue8
  
...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue20 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue20 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue
  
pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = extractelement <8 x i1> %0, i64 1
  br i1 %2, label %pred.store.if7, label %pred.store.continue8

pred.store.if7:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue8

pred.store.continue8:                             ; preds = %pred.store.if7, %pred.store.continue
  %3 = extractelement <8 x i1> %0, i64 2
  br i1 %3, label %pred.store.if9, label %pred.store.continue10

...

For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

Instead, I think if you're building with -Os and know that the blocks are going to be predicated due to tail-folding and replicated then we should just bail out immediately. For example, I think that collectInstsToScalarize sets up a list of PredicatedBBsAfterVectorization which you might be able to check in LoopVectorizationPlanner::plan.

It looks like there is also follow-on work:

Improve the lowering of lowtrip to be more efficient code where I hope/expect that vectorising should be better than scalar.
Teach the cost model that for the conditional loop vectorising is actually more beneficial than a scalar loop, at least on neoverse-v1.

david-arm · 2024-10-18T13:09:24Z

Sorry, please ignore my comment about icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15> being folded away - that's bogus because the last lane will be false.

Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result and register pressure is much lower. Given they are likely to be more expensive than scalar comparisons it makes sense to avoid sinking too many. The CodeGen/SystemZ/vec-perm-14.ll test does rely upon sinking a vector comparison so I've kept that behaviour by permitting one sink. Alternatively, I could also introduce a TLI hook to query the target if this is a preferred solution?

john-brawn-arm · 2024-10-22T14:11:39Z

For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

I don't think this is correct. In the function "conditional" without vectorizing the "i & 0x1" results in 16 comparisons (one for each iteration), and vectorizing by 8 reduces it to two vector comparisons, so vectorizing reduces the number of comparison instructions. In "lowtrip" the tail folding introduces two vector comparisons where we had no comparison before, so vectorizing increases the number of comparison instructions.

I went and measured this on cortex-a53, and the total time for 128 iterations of the lowtrip function is:
no vectorization: 7531
vectorization by 8: 28750
vectorization by 8 with codegenprepare hacked to not sink the cmp: 9663

In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

This patch doesn't always prefer a scalar loop. It fixes an error in the cost calculation, which leads to us choosing a scalar loop when it's faster than a vector loop. My example in the comment above (#109289 (comment)) shows a case where with this patch we still vectorize, as the calculated cost shows that it's faster than a scalar loop.

Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.

john-brawn-arm · 2024-11-01T14:07:05Z

Ping

Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.

john-brawn-arm · 2024-12-09T16:33:21Z

Ping.

david-arm · 2024-12-09T16:48:06Z

Hi @john-brawn-arm, thanks for taking the time to look into the
example I mentioned in my previous commit. However, this still
isn't my preferred approach to solving issue #66652
because it's using a performance-based cost model (that can be
unreliable) to try and solve what to me looks like a functional
defect. Here are two examples of problems with this:

As you mentioned yourself, there are still loops where we
vectorise with predicated blocks at -Os, so this patch is
only a partial fix. A more complete solution would involve
dealing with the root case - @fhahn gave a suggestion about
how to do this. Did you get a chance to look at this and see
if it's viable or easy to do?
The change in this PR is effectively dealing with two
aspects of tail-folding simultaneously: a) low trip count loops
when building with -O2 or above, and b) normal loops when
building with -Os. The point I was trying to make in my
previous message (and I realise now I may not have made this
very clear) is that the two could potentially be in
conflict with each other. I don't think it's impossible to
come up with a loop where the vectorised version with
predicated blocks is faster than the scalar, whilst
simultaneously being bad for code size. There is a chance
that someone reverts this PR later on for exactly this
reason.

Ideally we'd tackle the code size issue presented in #66652
and the low trip count performance issues with separate PRs,
using solutions that aren't tied to each other. I'd be
interested to know what @fhahn thinks as well here.

john-brawn-arm · 2024-12-10T15:11:51Z

The general problem with vectorization cost calculations is, as I've mentioned before, that everything uses the reciprocal throughput as the cost kind, but we should be using code size when optimizing for code size. I was going to fix that after this, as it's a much larger change, and we'll still need this patch as the cost calculation that it fixes is currently wrong no matter what cost kind is being used, but I'll get to work on getting it done and then get back to this.

john-brawn-arm requested review from fhahn, alexey-bataev, davemgreen and david-arm September 19, 2024 14:19

llvmbot added backend:PowerPC vectorizers llvm:transforms labels Sep 19, 2024

david-arm reviewed Sep 27, 2024

View reviewed changes

david-arm mentioned this pull request Oct 1, 2024

[LoopVectorize] Add cost of generating tail-folding mask to the loop #90191

Closed

Add cost model test

b9f8c9c

llvmbot added the llvm:analysis label Oct 2, 2024

david-arm reviewed Oct 3, 2024

View reviewed changes

Merge branch 'main' into vectorize_scalar_discount

30f3cf8

david-arm mentioned this pull request Oct 21, 2024

[CodeGen] Avoid sinking vector comparisons during CodeGenPrepare #113158

Closed

john-brawn-arm added 3 commits November 19, 2024 13:28

Merge branch 'main' into vectorize_scalar_discount

891792f

Merge branch 'main' into vectorize_scalar_discount

624a097

Merge branch 'main' into vectorize_scalar_discount

d3fbcfe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

john-brawn-arm commented Sep 19, 2024

llvmbot commented Sep 19, 2024 •

edited

Loading

david-arm Sep 27, 2024

john-brawn-arm Sep 30, 2024

john-brawn-arm Sep 30, 2024

david-arm Sep 27, 2024

david-arm Sep 30, 2024

john-brawn-arm Sep 30, 2024

david-arm Oct 3, 2024

john-brawn-arm Oct 3, 2024

david-arm Oct 11, 2024

john-brawn-arm Oct 14, 2024

david-arm Oct 14, 2024

fhahn Oct 14, 2024

john-brawn-arm Oct 17, 2024

john-brawn-arm Oct 17, 2024

david-arm commented Oct 18, 2024 •

edited

Loading

david-arm commented Oct 18, 2024

john-brawn-arm commented Oct 22, 2024 •

edited

Loading

john-brawn-arm commented Nov 1, 2024

john-brawn-arm commented Dec 9, 2024

david-arm commented Dec 9, 2024

john-brawn-arm commented Dec 10, 2024

		; CHECK-NEXT: EMIT vp<[[CMP:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]>
		; CHECK-NEXT: Successor(s): pred.load

[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

Are you sure you want to change the base?

[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

Conversation

john-brawn-arm commented Sep 19, 2024

llvmbot commented Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-arm commented Oct 18, 2024 • edited Loading

david-arm commented Oct 18, 2024

john-brawn-arm commented Oct 22, 2024 • edited Loading

john-brawn-arm commented Nov 1, 2024

john-brawn-arm commented Dec 9, 2024

david-arm commented Dec 9, 2024

john-brawn-arm commented Dec 10, 2024

llvmbot commented Sep 19, 2024 •

edited

Loading

david-arm commented Oct 18, 2024 •

edited

Loading

john-brawn-arm commented Oct 22, 2024 •

edited

Loading