Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

john-brawn-arm
Copy link
Collaborator

When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions.

Fixes #66652

…lding

When an instruction is scalarized due to tail folding the cost that
we calculate (which is then returned by getWideningCost and thus
getInstructionCost) is already the scalarized cost, so
computePredInstDiscount shouldn't apply a scalarization discount to
these instructions.

Fixes llvm#66652
@llvmbot
Copy link
Member

llvmbot commented Sep 19, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: John Brawn (john-brawn-arm)

Changes

When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions.

Fixes #66652


Patch is 87.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109289.diff

5 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+43-302)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+12-304)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/vplan-force-tail-with-evl.ll (+24-6)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+182-119)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 107fb38be31969..c64f8ccb3a84f7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5501,10 +5501,14 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
     // Scale the total scalar cost by block probability.
     ScalarCost /= getReciprocalPredBlockProb();
 
-    // Compute the discount. A non-negative discount means the vector version
-    // of the instruction costs more, and scalarizing would be beneficial.
-    Discount += VectorCost - ScalarCost;
-    ScalarCosts[I] = ScalarCost;
+    // Compute the discount, unless this instruction must be scalarized due to
+    // tail folding, as then the vector cost is already the scalar cost. A
+    // non-negative discount means the vector version of the instruction costs
+    // more, and scalarizing would be beneficial.
+    if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) {
+      Discount += VectorCost - ScalarCost;
+      ScalarCosts[I] = ScalarCost;
+    }
   }
 
   return Discount;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index 9910be7224674c..de1cae7383f863 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -529,93 +529,14 @@ define void @latch_branch_cost(ptr %dst) {
 ; PRED-LABEL: define void @latch_branch_cost(
 ; PRED-SAME: ptr [[DST:%.*]]) {
 ; PRED-NEXT:  entry:
-; PRED-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; PRED:       vector.ph:
-; PRED-NEXT:    br label [[VECTOR_BODY:%.*]]
-; PRED:       vector.body:
-; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
-; PRED-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]
-; PRED-NEXT:    [[TMP0:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99>
-; PRED-NEXT:    [[TMP1:%.*]] = extractelement <8 x i1> [[TMP0]], i32 0
-; PRED-NEXT:    br i1 [[TMP1]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; PRED:       pred.store.if:
-; PRED-NEXT:    [[TMP2:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP2]]
-; PRED-NEXT:    store i8 0, ptr [[TMP3]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE]]
-; PRED:       pred.store.continue:
-; PRED-NEXT:    [[TMP4:%.*]] = extractelement <8 x i1> [[TMP0]], i32 1
-; PRED-NEXT:    br i1 [[TMP4]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; PRED:       pred.store.if1:
-; PRED-NEXT:    [[TMP5:%.*]] = add i64 [[INDEX]], 1
-; PRED-NEXT:    [[TMP6:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP5]]
-; PRED-NEXT:    store i8 0, ptr [[TMP6]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE2]]
-; PRED:       pred.store.continue2:
-; PRED-NEXT:    [[TMP7:%.*]] = extractelement <8 x i1> [[TMP0]], i32 2
-; PRED-NEXT:    br i1 [[TMP7]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; PRED:       pred.store.if3:
-; PRED-NEXT:    [[TMP8:%.*]] = add i64 [[INDEX]], 2
-; PRED-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP8]]
-; PRED-NEXT:    store i8 0, ptr [[TMP9]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE4]]
-; PRED:       pred.store.continue4:
-; PRED-NEXT:    [[TMP10:%.*]] = extractelement <8 x i1> [[TMP0]], i32 3
-; PRED-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE7:%.*]]
-; PRED:       pred.store.if5:
-; PRED-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 3
-; PRED-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; PRED-NEXT:    store i8 0, ptr [[TMP12]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE7]]
-; PRED:       pred.store.continue6:
-; PRED-NEXT:    [[TMP13:%.*]] = extractelement <8 x i1> [[TMP0]], i32 4
-; PRED-NEXT:    br i1 [[TMP13]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; PRED:       pred.store.if7:
-; PRED-NEXT:    [[TMP14:%.*]] = add i64 [[INDEX]], 4
-; PRED-NEXT:    [[TMP15:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
-; PRED-NEXT:    store i8 0, ptr [[TMP15]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE8]]
-; PRED:       pred.store.continue8:
-; PRED-NEXT:    [[TMP16:%.*]] = extractelement <8 x i1> [[TMP0]], i32 5
-; PRED-NEXT:    br i1 [[TMP16]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; PRED:       pred.store.if9:
-; PRED-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], 5
-; PRED-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP17]]
-; PRED-NEXT:    store i8 0, ptr [[TMP18]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE10]]
-; PRED:       pred.store.continue10:
-; PRED-NEXT:    [[TMP19:%.*]] = extractelement <8 x i1> [[TMP0]], i32 6
-; PRED-NEXT:    br i1 [[TMP19]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; PRED:       pred.store.if11:
-; PRED-NEXT:    [[TMP20:%.*]] = add i64 [[INDEX]], 6
-; PRED-NEXT:    [[TMP21:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP20]]
-; PRED-NEXT:    store i8 0, ptr [[TMP21]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE12]]
-; PRED:       pred.store.continue12:
-; PRED-NEXT:    [[TMP22:%.*]] = extractelement <8 x i1> [[TMP0]], i32 7
-; PRED-NEXT:    br i1 [[TMP22]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE6]]
-; PRED:       pred.store.if13:
-; PRED-NEXT:    [[TMP23:%.*]] = add i64 [[INDEX]], 7
-; PRED-NEXT:    [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; PRED-NEXT:    store i8 0, ptr [[TMP24]], align 1
-; PRED-NEXT:    br label [[PRED_STORE_CONTINUE6]]
-; PRED:       pred.store.continue14:
-; PRED-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; PRED-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; PRED-NEXT:    [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], 104
-; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; PRED:       middle.block:
-; PRED-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
-; PRED:       scalar.ph:
-; PRED-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 104, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
 ; PRED-NEXT:    br label [[FOR_BODY:%.*]]
 ; PRED:       loop:
-; PRED-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; PRED-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
 ; PRED-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
 ; PRED-NEXT:    store i8 0, ptr [[GEP]], align 1
 ; PRED-NEXT:    [[INDVARS_IV_NEXT]] = add i64 [[IV]], 1
 ; PRED-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
-; PRED-NEXT:    br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; PRED-NEXT:    br i1 [[EXITCOND_NOT]], label [[EXIT:%.*]], label [[FOR_BODY]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret void
 ;
@@ -800,27 +721,27 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; PRED-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; PRED-NEXT:    [[TMP15:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT:    [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META6:![0-9]+]]
+; PRED-NEXT:    [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META4:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP16]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT29:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT28]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
-; PRED-NEXT:    [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META9:![0-9]+]]
+; PRED-NEXT:    [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META7:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP17]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
 ; PRED-NEXT:    [[TMP18:%.*]] = or <vscale x 4 x i32> [[BROADCAST_SPLAT]], [[BROADCAST_SPLAT29]]
-; PRED-NEXT:    [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META11:![0-9]+]]
+; PRED-NEXT:    [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META9:![0-9]+]]
 ; PRED-NEXT:    [[BROADCAST_SPLATINSERT30:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP19]], i64 0
 ; PRED-NEXT:    [[BROADCAST_SPLAT31:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT30]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
 ; PRED-NEXT:    [[TMP20:%.*]] = icmp ugt <vscale x 4 x i32> [[BROADCAST_SPLAT31]], [[TMP18]]
 ; PRED-NEXT:    [[TMP21:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP20]], <vscale x 4 x i1> zeroinitializer
 ; PRED-NEXT:    [[TMP22:%.*]] = getelementptr i32, ptr [[D]], i64 [[TMP15]]
-; PRED-NEXT:    call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META13:![0-9]+]], !noalias [[META15:![0-9]+]]
+; PRED-NEXT:    call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META11:![0-9]+]], !noalias [[META13:![0-9]+]]
 ; PRED-NEXT:    [[TMP23:%.*]] = getelementptr i32, ptr [[TMP22]], i32 0
-; PRED-NEXT:    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META17:![0-9]+]], !noalias [[META18:![0-9]+]]
+; PRED-NEXT:    call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META15:![0-9]+]], !noalias [[META16:![0-9]+]]
 ; PRED-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP14]])
 ; PRED-NEXT:    [[TMP24:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
 ; PRED-NEXT:    [[TMP25:%.*]] = extractelement <vscale x 4 x i1> [[TMP24]], i32 0
-; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
+; PRED-NEXT:    br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP17:![0-9]+]]
 ; PRED:       middle.block:
 ; PRED-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; PRED:       scalar.ph:
@@ -842,7 +763,7 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
 ; PRED:       loop.latch:
 ; PRED-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
 ; PRED-NEXT:    [[C_1:%.*]] = icmp eq i64 [[IV]], [[N]]
-; PRED-NEXT:    br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP20:![0-9]+]]
+; PRED-NEXT:    br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP18:![0-9]+]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret i32 0
 ;
@@ -959,7 +880,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; PRED-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX]], i64 [[TMP10]])
 ; PRED-NEXT:    [[TMP16:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)
 ; PRED-NEXT:    [[TMP17:%.*]] = extractelement <vscale x 2 x i1> [[TMP16]], i32 0
-; PRED-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
+; PRED-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
 ; PRED:       middle.block:
 ; PRED-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; PRED:       scalar.ph:
@@ -979,7 +900,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
 ; PRED-NEXT:    [[PTR_IV_NEXT]] = getelementptr i8, ptr [[PTR_IV]], i64 8
 ; PRED-NEXT:    [[IV_CLAMP:%.*]] = and i64 [[IV]], 4294967294
 ; PRED-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_CLAMP]], 512
-; PRED-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP22:![0-9]+]]
+; PRED-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP20:![0-9]+]]
 ; PRED:       exit:
 ; PRED-NEXT:    ret void
 ;
@@ -1003,208 +924,34 @@ exit:
   ret void
 }
 
-define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
-; DEFAULT-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+define void @low_trip_count_store(ptr %dst) {
+; DEFAULT-LABEL: define void @low_trip_count_store(
 ; DEFAULT-SAME: ptr [[DST:%.*]]) {
 ; DEFAULT-NEXT:  entry:
-; DEFAULT-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; DEFAULT:       vector.ph:
-; DEFAULT-NEXT:    br label [[VECTOR_BODY:%.*]]
-; DEFAULT:       vector.body:
-; DEFAULT-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE14:%.*]] ]
-; DEFAULT-NEXT:    [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE14]] ]
-; DEFAULT-NEXT:    [[TMP0:%.*]] = trunc i64 [[INDEX]] to i8
-; DEFAULT-NEXT:    [[TMP1:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6>
-; DEFAULT-NEXT:    [[TMP2:%.*]] = extractelement <8 x i1> [[TMP1]], i32 0
-; DEFAULT-NEXT:    br i1 [[TMP2]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; DEFAULT:       pred.store.if:
-; DEFAULT-NEXT:    [[TMP3:%.*]] = add i64 [[INDEX]], 0
-; DEFAULT-NEXT:    [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP3]]
-; DEFAULT-NEXT:    [[TMP5:%.*]] = add i8 [[TMP0]], 0
-; DEFAULT-NEXT:    store i8 [[TMP5]], ptr [[TMP4]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE]]
-; DEFAULT:       pred.store.continue:
-; DEFAULT-NEXT:    [[TMP6:%.*]] = extractelement <8 x i1> [[TMP1]], i32 1
-; DEFAULT-NEXT:    br i1 [[TMP6]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; DEFAULT:       pred.store.if1:
-; DEFAULT-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 1
-; DEFAULT-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP7]]
-; DEFAULT-NEXT:    [[TMP9:%.*]] = add i8 [[TMP0]], 1
-; DEFAULT-NEXT:    store i8 [[TMP9]], ptr [[TMP8]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE2]]
-; DEFAULT:       pred.store.continue2:
-; DEFAULT-NEXT:    [[TMP10:%.*]] = extractelement <8 x i1> [[TMP1]], i32 2
-; DEFAULT-NEXT:    br i1 [[TMP10]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; DEFAULT:       pred.store.if3:
-; DEFAULT-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 2
-; DEFAULT-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; DEFAULT-NEXT:    [[TMP13:%.*]] = add i8 [[TMP0]], 2
-; DEFAULT-NEXT:    store i8 [[TMP13]], ptr [[TMP12]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE4]]
-; DEFAULT:       pred.store.continue4:
-; DEFAULT-NEXT:    [[TMP14:%.*]] = extractelement <8 x i1> [[TMP1]], i32 3
-; DEFAULT-NEXT:    br i1 [[TMP14]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
-; DEFAULT:       pred.store.if5:
-; DEFAULT-NEXT:    [[TMP15:%.*]] = add i64 [[INDEX]], 3
-; DEFAULT-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP15]]
-; DEFAULT-NEXT:    [[TMP17:%.*]] = add i8 [[TMP0]], 3
-; DEFAULT-NEXT:    store i8 [[TMP17]], ptr [[TMP16]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE6]]
-; DEFAULT:       pred.store.continue6:
-; DEFAULT-NEXT:    [[TMP18:%.*]] = extractelement <8 x i1> [[TMP1]], i32 4
-; DEFAULT-NEXT:    br i1 [[TMP18]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; DEFAULT:       pred.store.if7:
-; DEFAULT-NEXT:    [[TMP19:%.*]] = add i64 [[INDEX]], 4
-; DEFAULT-NEXT:    [[TMP20:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP19]]
-; DEFAULT-NEXT:    [[TMP21:%.*]] = add i8 [[TMP0]], 4
-; DEFAULT-NEXT:    store i8 [[TMP21]], ptr [[TMP20]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE8]]
-; DEFAULT:       pred.store.continue8:
-; DEFAULT-NEXT:    [[TMP22:%.*]] = extractelement <8 x i1> [[TMP1]], i32 5
-; DEFAULT-NEXT:    br i1 [[TMP22]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; DEFAULT:       pred.store.if9:
-; DEFAULT-NEXT:    [[TMP23:%.*]] = add i64 [[INDEX]], 5
-; DEFAULT-NEXT:    [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; DEFAULT-NEXT:    [[TMP25:%.*]] = add i8 [[TMP0]], 5
-; DEFAULT-NEXT:    store i8 [[TMP25]], ptr [[TMP24]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE10]]
-; DEFAULT:       pred.store.continue10:
-; DEFAULT-NEXT:    [[TMP26:%.*]] = extractelement <8 x i1> [[TMP1]], i32 6
-; DEFAULT-NEXT:    br i1 [[TMP26]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; DEFAULT:       pred.store.if11:
-; DEFAULT-NEXT:    [[TMP27:%.*]] = add i64 [[INDEX]], 6
-; DEFAULT-NEXT:    [[TMP28:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP27]]
-; DEFAULT-NEXT:    [[TMP29:%.*]] = add i8 [[TMP0]], 6
-; DEFAULT-NEXT:    store i8 [[TMP29]], ptr [[TMP28]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE12]]
-; DEFAULT:       pred.store.continue12:
-; DEFAULT-NEXT:    [[TMP30:%.*]] = extractelement <8 x i1> [[TMP1]], i32 7
-; DEFAULT-NEXT:    br i1 [[TMP30]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE14]]
-; DEFAULT:       pred.store.if13:
-; DEFAULT-NEXT:    [[TMP31:%.*]] = add i64 [[INDEX]], 7
-; DEFAULT-NEXT:    [[TMP32:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP31]]
-; DEFAULT-NEXT:    [[TMP33:%.*]] = add i8 [[TMP0]], 7
-; DEFAULT-NEXT:    store i8 [[TMP33]], ptr [[TMP32]], align 1
-; DEFAULT-NEXT:    br label [[PRED_STORE_CONTINUE14]]
-; DEFAULT:       pred.store.continue14:
-; DEFAULT-NEXT:    [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; DEFAULT-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; DEFAULT-NEXT:    br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
-; DEFAULT:       middle.block:
-; DEFAULT-NEXT:    br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
-; DEFAULT:       scalar.ph:
-; DEFAULT-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
 ; DEFAULT-NEXT:    br label [[LOOP:%.*]]
 ; DEFAULT:       loop:
-; DEFAULT-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; DEFAULT-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
 ; DEFAULT-NEXT:    [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i8
 ; DEFAULT-NEXT:    [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
 ; DEFAULT-NEXT:    store i8 [[IV_TRUNC]], ptr [[GEP]], align 1
 ; DEFAULT-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
 ; DEFAULT-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
-; DEFAULT-NEXT:    br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP27:![0-9]+]]
+; DEFAULT-NEXT:    br i1 [[EC]], label [[EXIT:%.*]], label [[LOOP]]
 ; DEFAULT:       exit:
 ; DEFAULT-NEXT:    ret void
 ;
-; PRED-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+; PRED-LABEL: define void @low_trip_count_store(
 ; PRED-SAME: ptr [[DST:%.*]]) {
 ; PRED-NEXT:  entry:
-; PRED-NEXT:    br i1 false, label [[SCALAR_PH...
[truncated]

; CHECK-NEXT: EMIT vp<[[CMP:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]>
; CHECK-NEXT: Successor(s): pred.load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance this looks worse than before, unless I'm missing something. It looks like previously we were reusing the same predicated blocks to perform both the load and store, i.e. something like

  %cmp = icmp ... <4 x i32>
  %lane0 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0, label %block1.if, label %block1.continue

%block1.if:
  .. do load ..
  .. do store ..
...

whereas now we've essentially split out the loads and stores with duplicate control flow, i.e.

  %cmp = icmp ... <4 x i32>
  %lane0 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0, label %block1.load.if, label %block1.load.continue

%block1.if:
  .. do load ..
...

%stores:
  %lane0.1 = extractelement <4 x i32> %cmp, i32 0
  br i1 %lane0.1, label %block1.store.if, label %block1.store.continue
...

I'd expect the extra control flow to hurt performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a bit strange as both before and after this patch this VPlan isn't used, as it decides the scalar version costs less. As to why it decides this VPlan is the best for this vector length, computePredInstDiscount doesn't currently account for any control flow instructions it could be able to remove by scalarizing everything, though I don't know how it would do that, as it's making a decision about whether to scalarize the chain of instructions feeding into a store, and the chose to scalarize the loads happens elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the best thing to do here is to change the test to check that it's making the correct decision, which is not to vectorize.

// tail folding, as then the vector cost is already the scalar cost. A
// non-negative discount means the vector version of the instruction costs
// more, and scalarizing would be beneficial.
if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I'm not too familiar with this code and I need some time to understand what effect this change has on the costs, but I'll take a deeper look next week!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so this patch is saying that if block needs predication due to tail-folding and we've already decided to scalarise the instruction for the vector VF, then we shouldn't apply a discount. However, it feels like there two problems with this:

  1. I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?
  2. In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?

I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?

This is possibly true, but I tried that and there's a lot more test changes as a result, and briefly looking at them it wasn't immediately obvious if they were better or worse.

  1. In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?

Looking at low_trip_count_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll the current sequence of events that happens is (for a vector length of 4):

  • getMemInstScalarizationCost is called on the store, which needs to be scalarized and predicated because of tail masking, and calculates a cost of 20, which includes the cost of doing the compare and branch into the predicated block.
  • computePredInstDiscount uses this as the VectorCost, as it uses getInstructionCost which for scalarized instruction returns the cost value in InstsToScalarize
  • The calculation of ScalarCost assumes that the predicated block already exists, so it just calculates the cost of moving the instruction into the predicated block and scalarizing it in there, which it calculates as 4 (for 4 copies of the store)
  • This results in a discount of 16, causing 20-16=4 to be used as the cost of the scalarized store.

ScalarCost is too low for the scalarized store, because it's assuming that the predicated block already exists, but the cost of the predicated block is exactly what we need to take into account to avoid pointless tail folding by masking.

I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.

I'll add these tests.

; CHECK: LV: Vector loop of width 4 costs: 4.
; CHECK: LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
; CHECK: LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
; CHECK: LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1. I suspect what your change does is increase the effective cost of the store by saying that we shouldn't scalarise it. However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1

Current costs, without this patch:

LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 1 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 1 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %ec, label %exit, label %loop
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 2 for VF 2 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 2 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 2 costs: 3.
LV: Found an estimated cost of 0 for VF 4 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 4 for VF 4 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 4 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 2 for VF 4 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 4 costs: 2.
LV: Found an estimated cost of 0 for VF 8 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 8 for VF 8 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 8 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 4 for VF 8 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 8 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 8 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 8 costs: 2.
LV: Found an estimated cost of 0 for VF 16 For instruction:   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 16 for VF 16 For instruction:   %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 16 For instruction:   %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 16 for VF 16 For instruction:   store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 8 for VF 16 For instruction:   %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 16 For instruction:   %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 16 For instruction:   br i1 %ec, label %exit, label %loop
LV: Vector loop of width 16 costs: 2.
LV: Selecting VF: 8.

However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?

That sounds like doing more work for the same end result? We already know that there can be no discount, because the "vector cost" for an instruction we've already decided to scalarize is already the scalarized cost, so there's no point in calculating the same thing again.

Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.

If we've already decided to scalarize by the time we've called computePredInstDiscount (i.e. getWideningDecision returns CM_Scalarize) then that means using a vector instruction is impossible (because no vector instruction exists for the operation we're trying to do, in this case we're not using sve so we don't have a masked vector store instruction).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the end result may be the same, but by fixing the cost model we hopefully fix one of the root causes and potentially fix other hidden problems that haven't yet come to light.

Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.

One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see isLegalMaskedStore and isLegalMaskedLoad in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.

It appears it used to be the case that not having masked loads and stores disabled tail folding, but 9d1b2acaaa224 changed it to using a cost model instead.

One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see isLegalMaskedStore and isLegalMaskedLoad in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?

TargetTransformInfo::preferPredicateOverEpilogue already defaults to preferring a scalar epilogue, and the two targets that implement their own version (AArch64 and ARM) prefer scalar epilogue when we don't have masked loads and stores. So the only situation where we're even considering using tail predication if we don't have masked loads/stores is when using a scalar epilogue is impossible, so the choice is between vectorizing using tail folding or not vectorizing at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I don't actually mean changing preferPredicateOverEpilogue because this is a top level decision that's used in general for all loops, regardless of the trip count, code size concerns due to -Os/-Oz. I meant the code where we decide to vectorise due to the low trip count, which can still happen even when preferPredicateOverEpilogue returns false. For example, if you look in the function LoopVectorizationCostModel::computeMaxVF at this code here:

  // If we don't know the precise trip count, or if the trip count that we
  // found modulo the vectorization factor is not zero, try to fold the tail
  // by masking.
  // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
  setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
  if (foldTailByMasking()) {

I added this extra code just before:

  if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedLowTripLoop) {
    for (BasicBlock *BB : TheLoop->getBlocks()) {
      for (Instruction &I : *BB) {
        if (isa<LoadInst>(&I) || isa<StoreInst>(&I)) {
          auto *Ptr = getLoadStorePointerOperand(&I);
          auto *Ty = getLoadStoreType(&I);
          const Align Alignment = getLoadStoreAlignment(&I);
          if ((isa<LoadInst>(&I) && !isLegalMaskedLoad(Ty, Ptr, Alignment)) ||
              (isa<StoreInst>(&I) && !isLegalMaskedStore(Ty, Ptr, Alignment))) {
            LLVM_DEBUG(dbgs() << "LV: Not tail-folding due to lack of masked load/store support\n");
            // We could also just return FixedScalableVFPair::getNone() here.
            ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
            return MaxFactors;
          }
        }
      }
    }
  }

And now the code for @store_const_fixed_trip_count looks something like this:

define void @store_const_fixed_trip_count(ptr %dst) {
entry:
  br i1 false, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %entry
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = add i64 %index, 0
  %1 = getelementptr i8, ptr %dst, i64 %0
  %2 = getelementptr i8, ptr %1, i32 0
  store <4 x i8> <i8 1, i8 1, i8 1, i8 1>, ptr %2, align 1
  %index.next = add nuw i64 %index, 4
  %3 = icmp eq i64 %index.next, 4
  br i1 %3, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  br i1 false, label %exit, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %entry
  %bc.resume.val = phi i64 [ 4, %middle.block ], [ 0, %entry ]
  br label %loop

loop:                                             ; preds = %loop, %scalar.ph
  %iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %iv.next, %loop ]
  %gep = getelementptr i8, ptr %dst, i64 %iv
  store i8 1, ptr %gep, align 1
  %iv.next = add i64 %iv, 1
  %ec = icmp eq i64 %iv.next, 7
  br i1 %ec, label %exit, label %loop, !llvm.loop !3

exit:                                             ; preds = %middle.block, %loop
  ret void
}

The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.

This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.

Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.

There are situations where tail-folding is faster than not vectorizing, e.g. the following example

__attribute__((noinline))
void fn(char *p, char a, char b, char c) {
  for (char i = 0; i < 7; ++i) {
    p[i] = i*a + (i>>1)*b + (i>>2)*c;
  }
}

int main() {
  char arr[256];
  fn(arr, 3, 5, 7);
  return arr[0];
}

On cortex-a53 compiling with -O3 -fno-loop-unroll, fn takes 65 cycles if we vectorize using tail folding and 74 cycles if we don't vectorize. I don't know how realistic this is as an example though, as without loop unrolling disabled it gets unrolled and that takes 43 cycles, but it does show that there are situations where just disabling tail folding outright leads to a worse result.

Also in general I think it's better to make decisions based on calculating the cost of things if we can, unless we can guarantee that a certain decision is definitely going to be worse, which we can't do here. I also don't think the change I'm making to computePredInstDiscount is a workaround, as what it's doing is making the cost we calculate more accurate to the actual cost of doing this, and more accurate costs should lead to better decisions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.

This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.

Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?

Perhaps, but there's a general problem that the current cost calculations for Os are just wrong: we use TCK_RecipThroughput everywhere, and for predicated instructions we scale the cost by the block probability. I'll be working on a patch after this to try and fix at least some of this.

@david-arm
Copy link
Contributor

david-arm commented Oct 18, 2024

I compiled these tests that all end up using predicated blocks, using the command "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289.c":

void lowtrip(short *dest) {
  for (int i = 0; i < 15; i++) {
    dest[i] = 1;
  }
}

void conditional(short *dest, int n) {
  for (int i = 0; i < 16; i++) {
    if (i & 0x1)
      dest[i] = 1;  
  }
}

Compile this test with "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289_optsize.c":

void optsize(short *dest, short val, int n) {
  for (int i = 0; i < n; i++) { 
    dest[i] = val;
  }
} 

The IR before the loop vectoriser looks like this:

== pr109289.c ==

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %for.body
 
for.cond.cleanup:                                 ; preds = %for.body
  ret void

for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 15
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable
define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.inc
  ret void
  
for.body:                                         ; preds = %entry, %for.inc
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
  %and6 = and i64 %indvars.iv, 1
  %tobool.not = icmp eq i64 %and6, 0
  br i1 %tobool.not, label %for.inc, label %if.then
  
if.then:                                          ; preds = %for.body
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 1, ptr %arrayidx, align 2
  br label %for.inc
  
for.inc:                                          ; preds = %for.body, %if.then
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}
  
attributes #0 = { "target-cpu"="generic" }

== pr109289_optsize.c ==

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"
  
; Function Attrs: nofree norecurse nosync nounwind memory(argmem: write) uwtable
define dso_local void @optsize(ptr nocapture noundef writeonly %dest, i16 noundef %val, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %cmp3 = icmp sgt i32 %n, 0
  br i1 %cmp3, label %for.body.preheader, label %for.cond.cleanup   
  
for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext nneg i32 %n to i64
  br label %for.body
  
for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup
 
for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret void

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i16, ptr %dest, i64 %indvars.iv
  store i16 %val, ptr %arrayidx, align 2   
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
}

attributes #0 = { optsize "target-cpu"="generic" }

In all 3 tests the vectoriser decides to vectorise as follows:

== lowtrip ==
LV: Not allowing scalar epilogue due to low trip count.
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 1
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 1
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 1
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1

== conditional ==
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16 1
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 4 for VF 2 For instruction: br i1
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 1
LV: Vector loop of width 2 costs: 5.
LV: Found an estimated cost of 8 for VF 4 For instruction: br i1
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 1
LV: Vector loop of width 4 costs: 4.
LV: Found an estimated cost of 16 for VF 8 For instruction: br i1
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 1
LV: Vector loop of width 8 costs: 4.
LV: Selecting VF: 1.

== optsize ==
LV: Not allowing scalar epilogue due to -Os/-Oz.
LV: Found an estimated cost of 2 for VF 1 For instruction: store i16
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 2 for VF 2 For instruction: store i16
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %val
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %val
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
Executing best plan with VF=8, UF=1

In all three cases we have made the wrong decision based on the currently generated code. These are some example runtimes with vectorisation of all 3 (I had to force vectorisation of conditional) on neoverse-v1 (I ran each function in a loop many times and timed them):

lowtrip: 50 ms
conditional: 31 ms
optsize: 47 ms

Without vectorisation on neoverse-v1:

lowtrip: 26 ms
conditional: 45 ms
optsize: 26 ms

On the surface it appears for lowtrip and optsize we shouldn't be vectorising, but for conditional we should. When I dug into this it seems that although conditional should suffer the same problem as lowtrip and optsize, the codegen is actually much better. This is the final IR for lowtrip and conditional:

define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue18 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %pred.store.continue20 ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue
  
...

You can see what's happened after the CodeGenPrepare pass for each test:

define dso_local void @lowtrip(ptr nocapture noundef writeonly %dest) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue18, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue18 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue18 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue18 ]
  %0 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %1 = extractelement <8 x i1> %0, i64 0   
  br i1 %1, label %pred.store.if, label %pred.store.continue
  
pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue

pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %3 = extractelement <8 x i1> %2, i64 1
  br i1 %3, label %pred.store.if5, label %pred.store.continue6

pred.store.if5:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue6

pred.store.continue6:                             ; preds = %pred.store.if5, %pred.store.continue
  %4 = icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15>
  %5 = extractelement <8 x i1> %4, i64 2
  br i1 %5, label %pred.store.if7, label %pred.store.continue8
  
...

define dso_local void @conditional(ptr nocapture noundef writeonly %dest, i32 noundef %n) local_unnamed_addr #0 {
entry:
  %scevgep = getelementptr i8, ptr %dest, i64 8
  br label %vector.body

vector.body:                                      ; preds = %pred.store.continue20, %entry
  %lsr.iv1 = phi ptr [ %scevgep2, %pred.store.continue20 ], [ %scevgep, %entry ]
  %lsr.iv = phi i64 [ %lsr.iv.next, %pred.store.continue20 ], [ 16, %entry ]
  %vec.ind = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %entry ], [ %vec.ind.next, %pred.store.continue20 ]
  %0 = trunc <8 x i64> %vec.ind to <8 x i1>
  %1 = extractelement <8 x i1> %0, i64 0
  br i1 %1, label %pred.store.if, label %pred.store.continue

pred.store.if:                                    ; preds = %vector.body
  %scevgep3 = getelementptr i8, ptr %lsr.iv1, i64 -8
  store i16 1, ptr %scevgep3, align 2, !tbaa !6
  br label %pred.store.continue
  
pred.store.continue:                              ; preds = %pred.store.if, %vector.body
  %2 = extractelement <8 x i1> %0, i64 1
  br i1 %2, label %pred.store.if7, label %pred.store.continue8

pred.store.if7:                                   ; preds = %pred.store.continue
  %scevgep5 = getelementptr i8, ptr %lsr.iv1, i64 -6
  store i16 1, ptr %scevgep5, align 2, !tbaa !6
  br label %pred.store.continue8

pred.store.continue8:                             ; preds = %pred.store.if7, %pred.store.continue
  %3 = extractelement <8 x i1> %0, i64 2
  br i1 %3, label %pred.store.if9, label %pred.store.continue10

...

For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

Instead, I think if you're building with -Os and know that the blocks are going to be predicated due to tail-folding and replicated then we should just bail out immediately. For example, I think that collectInstsToScalarize sets up a list of PredicatedBBsAfterVectorization which you might be able to check in LoopVectorizationPlanner::plan.

It looks like there is also follow-on work:

  1. Improve the lowering of lowtrip to be more efficient code where I hope/expect that vectorising should be better than scalar.
  2. Teach the cost model that for the conditional loop vectorising is actually more beneficial than a scalar loop, at least on neoverse-v1.

@david-arm
Copy link
Contributor

Sorry, please ignore my comment about icmp ult <8 x i64> %vec.ind, <i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15, i64 15> being folded away - that's bogus because the last lane will be false.

david-arm added a commit to david-arm/llvm-project that referenced this pull request Oct 21, 2024
Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result and register pressure is much
lower. Given they are likely to be more expensive than scalar
comparisons it makes sense to avoid sinking too many. The
CodeGen/SystemZ/vec-perm-14.ll test does rely upon sinking a
vector comparison so I've kept that behaviour by permitting one
sink.

Alternatively, I could also introduce a TLI hook to query the
target if this is a preferred solution?
@john-brawn-arm
Copy link
Collaborator Author

john-brawn-arm commented Oct 22, 2024

For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

I don't think this is correct. In the function "conditional" without vectorizing the "i & 0x1" results in 16 comparisons (one for each iteration), and vectorizing by 8 reduces it to two vector comparisons, so vectorizing reduces the number of comparison instructions. In "lowtrip" the tail folding introduces two vector comparisons where we had no comparison before, so vectorizing increases the number of comparison instructions.

I went and measured this on cortex-a53, and the total time for 128 iterations of the lowtrip function is:
no vectorization: 7531
vectorization by 8: 28750
vectorization by 8 with codegenprepare hacked to not sink the cmp: 9663

In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

This patch doesn't always prefer a scalar loop. It fixes an error in the cost calculation, which leads to us choosing a scalar loop when it's faster than a vector loop. My example in the comment above (#109289 (comment)) shows a case where with this patch we still vectorize, as the calculated cost shows that it's faster than a scalar loop.

david-arm added a commit to david-arm/llvm-project that referenced this pull request Oct 24, 2024
Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Oct 25, 2024
Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.
@john-brawn-arm
Copy link
Collaborator Author

Ping

david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 5, 2024
Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Nov 5, 2024
Whilst reviewing PR llvm#109289 and doing some analysis with various
tests involving predicated blocks I noticed that we're making
codegen and performance worse by sinking vector comparisons
multiple times into blocks. It looks like the sinkCmpExpression
in CodeGenPrepare was written for scalar comparisons where there
is only a single condition register, whereas vector comparisons
typically produce a vector result. For some targets, such a NEON
or SVE, there are multiple allocatable vector registers that can
store the result and so we should avoid sinking in that case.
@john-brawn-arm
Copy link
Collaborator Author

Ping.

@david-arm
Copy link
Contributor

Hi @john-brawn-arm, thanks for taking the time to look into the
example I mentioned in my previous commit. However, this still
isn't my preferred approach to solving issue #66652
because it's using a performance-based cost model (that can be
unreliable) to try and solve what to me looks like a functional
defect. Here are two examples of problems with this:

  1. As you mentioned yourself, there are still loops where we
    vectorise with predicated blocks at -Os, so this patch is
    only a partial fix. A more complete solution would involve
    dealing with the root case - @fhahn gave a suggestion about
    how to do this. Did you get a chance to look at this and see
    if it's viable or easy to do?
  2. The change in this PR is effectively dealing with two
    aspects of tail-folding simultaneously: a) low trip count loops
    when building with -O2 or above, and b) normal loops when
    building with -Os. The point I was trying to make in my
    previous message (and I realise now I may not have made this
    very clear) is that the two could potentially be in
    conflict with each other. I don't think it's impossible to
    come up with a loop where the vectorised version with
    predicated blocks is faster than the scalar, whilst
    simultaneously being bad for code size. There is a chance
    that someone reverts this PR later on for exactly this
    reason.

Ideally we'd tackle the code size issue presented in #66652
and the low trip count performance issues with separate PRs,
using solutions that aren't tied to each other. I'd be
interested to know what @fhahn thinks as well here.

@john-brawn-arm
Copy link
Collaborator Author

The general problem with vectorization cost calculations is, as I've mentioned before, that everything uses the reciprocal throughput as the cost kind, but we should be using code size when optimizing for code size. I was going to fix that after this, as it's a much larger change, and we'll still need this patch as the cost calculation that it fixes is currently wrong no matter what cost kind is being used, but I'll get to work on getting it done and then get back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bizarre optimization of simple array-filling loop in -Os
4 participants