-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289
base: main
Are you sure you want to change the base?
[LoopVectorize] Don't discount instructions scalarized due to tail folding #109289
Conversation
…lding When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions. Fixes llvm#66652
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: John Brawn (john-brawn-arm) ChangesWhen an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions. Fixes #66652 Patch is 87.68 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/109289.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 107fb38be31969..c64f8ccb3a84f7 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5501,10 +5501,14 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
// Scale the total scalar cost by block probability.
ScalarCost /= getReciprocalPredBlockProb();
- // Compute the discount. A non-negative discount means the vector version
- // of the instruction costs more, and scalarizing would be beneficial.
- Discount += VectorCost - ScalarCost;
- ScalarCosts[I] = ScalarCost;
+ // Compute the discount, unless this instruction must be scalarized due to
+ // tail folding, as then the vector cost is already the scalar cost. A
+ // non-negative discount means the vector version of the instruction costs
+ // more, and scalarizing would be beneficial.
+ if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) {
+ Discount += VectorCost - ScalarCost;
+ ScalarCosts[I] = ScalarCost;
+ }
}
return Discount;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
index 9910be7224674c..de1cae7383f863 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll
@@ -529,93 +529,14 @@ define void @latch_branch_cost(ptr %dst) {
; PRED-LABEL: define void @latch_branch_cost(
; PRED-SAME: ptr [[DST:%.*]]) {
; PRED-NEXT: entry:
-; PRED-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; PRED: vector.ph:
-; PRED-NEXT: br label [[VECTOR_BODY:%.*]]
-; PRED: vector.body:
-; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE6:%.*]] ]
-; PRED-NEXT: [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE6]] ]
-; PRED-NEXT: [[TMP0:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99, i64 99>
-; PRED-NEXT: [[TMP1:%.*]] = extractelement <8 x i1> [[TMP0]], i32 0
-; PRED-NEXT: br i1 [[TMP1]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; PRED: pred.store.if:
-; PRED-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT: [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP2]]
-; PRED-NEXT: store i8 0, ptr [[TMP3]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE]]
-; PRED: pred.store.continue:
-; PRED-NEXT: [[TMP4:%.*]] = extractelement <8 x i1> [[TMP0]], i32 1
-; PRED-NEXT: br i1 [[TMP4]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; PRED: pred.store.if1:
-; PRED-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 1
-; PRED-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP5]]
-; PRED-NEXT: store i8 0, ptr [[TMP6]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE2]]
-; PRED: pred.store.continue2:
-; PRED-NEXT: [[TMP7:%.*]] = extractelement <8 x i1> [[TMP0]], i32 2
-; PRED-NEXT: br i1 [[TMP7]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; PRED: pred.store.if3:
-; PRED-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 2
-; PRED-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP8]]
-; PRED-NEXT: store i8 0, ptr [[TMP9]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE4]]
-; PRED: pred.store.continue4:
-; PRED-NEXT: [[TMP10:%.*]] = extractelement <8 x i1> [[TMP0]], i32 3
-; PRED-NEXT: br i1 [[TMP10]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE7:%.*]]
-; PRED: pred.store.if5:
-; PRED-NEXT: [[TMP11:%.*]] = add i64 [[INDEX]], 3
-; PRED-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; PRED-NEXT: store i8 0, ptr [[TMP12]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE7]]
-; PRED: pred.store.continue6:
-; PRED-NEXT: [[TMP13:%.*]] = extractelement <8 x i1> [[TMP0]], i32 4
-; PRED-NEXT: br i1 [[TMP13]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; PRED: pred.store.if7:
-; PRED-NEXT: [[TMP14:%.*]] = add i64 [[INDEX]], 4
-; PRED-NEXT: [[TMP15:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP14]]
-; PRED-NEXT: store i8 0, ptr [[TMP15]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE8]]
-; PRED: pred.store.continue8:
-; PRED-NEXT: [[TMP16:%.*]] = extractelement <8 x i1> [[TMP0]], i32 5
-; PRED-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; PRED: pred.store.if9:
-; PRED-NEXT: [[TMP17:%.*]] = add i64 [[INDEX]], 5
-; PRED-NEXT: [[TMP18:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP17]]
-; PRED-NEXT: store i8 0, ptr [[TMP18]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE10]]
-; PRED: pred.store.continue10:
-; PRED-NEXT: [[TMP19:%.*]] = extractelement <8 x i1> [[TMP0]], i32 6
-; PRED-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; PRED: pred.store.if11:
-; PRED-NEXT: [[TMP20:%.*]] = add i64 [[INDEX]], 6
-; PRED-NEXT: [[TMP21:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP20]]
-; PRED-NEXT: store i8 0, ptr [[TMP21]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE12]]
-; PRED: pred.store.continue12:
-; PRED-NEXT: [[TMP22:%.*]] = extractelement <8 x i1> [[TMP0]], i32 7
-; PRED-NEXT: br i1 [[TMP22]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE6]]
-; PRED: pred.store.if13:
-; PRED-NEXT: [[TMP23:%.*]] = add i64 [[INDEX]], 7
-; PRED-NEXT: [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; PRED-NEXT: store i8 0, ptr [[TMP24]], align 1
-; PRED-NEXT: br label [[PRED_STORE_CONTINUE6]]
-; PRED: pred.store.continue14:
-; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; PRED-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; PRED-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT]], 104
-; PRED-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; PRED: middle.block:
-; PRED-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
-; PRED: scalar.ph:
-; PRED-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 104, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
; PRED-NEXT: br label [[FOR_BODY:%.*]]
; PRED: loop:
-; PRED-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; PRED-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
; PRED-NEXT: [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
; PRED-NEXT: store i8 0, ptr [[GEP]], align 1
; PRED-NEXT: [[INDVARS_IV_NEXT]] = add i64 [[IV]], 1
; PRED-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
-; PRED-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; PRED-NEXT: br i1 [[EXITCOND_NOT]], label [[EXIT:%.*]], label [[FOR_BODY]]
; PRED: exit:
; PRED-NEXT: ret void
;
@@ -800,27 +721,27 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
; PRED-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; PRED-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
; PRED-NEXT: [[TMP15:%.*]] = add i64 [[INDEX]], 0
-; PRED-NEXT: [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META6:![0-9]+]]
+; PRED-NEXT: [[TMP16:%.*]] = load i32, ptr [[A]], align 4, !alias.scope [[META4:![0-9]+]]
; PRED-NEXT: [[BROADCAST_SPLATINSERT28:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP16]], i64 0
; PRED-NEXT: [[BROADCAST_SPLAT29:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT28]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
-; PRED-NEXT: [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META9:![0-9]+]]
+; PRED-NEXT: [[TMP17:%.*]] = load i32, ptr [[B]], align 4, !alias.scope [[META7:![0-9]+]]
; PRED-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP17]], i64 0
; PRED-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
; PRED-NEXT: [[TMP18:%.*]] = or <vscale x 4 x i32> [[BROADCAST_SPLAT]], [[BROADCAST_SPLAT29]]
-; PRED-NEXT: [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META11:![0-9]+]]
+; PRED-NEXT: [[TMP19:%.*]] = load i32, ptr [[C]], align 4, !alias.scope [[META9:![0-9]+]]
; PRED-NEXT: [[BROADCAST_SPLATINSERT30:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP19]], i64 0
; PRED-NEXT: [[BROADCAST_SPLAT31:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT30]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
; PRED-NEXT: [[TMP20:%.*]] = icmp ugt <vscale x 4 x i32> [[BROADCAST_SPLAT31]], [[TMP18]]
; PRED-NEXT: [[TMP21:%.*]] = select <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], <vscale x 4 x i1> [[TMP20]], <vscale x 4 x i1> zeroinitializer
; PRED-NEXT: [[TMP22:%.*]] = getelementptr i32, ptr [[D]], i64 [[TMP15]]
-; PRED-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META13:![0-9]+]], !noalias [[META15:![0-9]+]]
+; PRED-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP18]], <vscale x 4 x ptr> [[BROADCAST_SPLAT33]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META11:![0-9]+]], !noalias [[META13:![0-9]+]]
; PRED-NEXT: [[TMP23:%.*]] = getelementptr i32, ptr [[TMP22]], i32 0
-; PRED-NEXT: call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META17:![0-9]+]], !noalias [[META18:![0-9]+]]
+; PRED-NEXT: call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> zeroinitializer, ptr [[TMP23]], i32 4, <vscale x 4 x i1> [[TMP21]]), !alias.scope [[META15:![0-9]+]], !noalias [[META16:![0-9]+]]
; PRED-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX]], i64 [[TMP14]])
; PRED-NEXT: [[TMP24:%.*]] = xor <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
; PRED-NEXT: [[TMP25:%.*]] = extractelement <vscale x 4 x i1> [[TMP24]], i32 0
-; PRED-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
+; PRED-NEXT: br i1 [[TMP25]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP17:![0-9]+]]
; PRED: middle.block:
; PRED-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; PRED: scalar.ph:
@@ -842,7 +763,7 @@ define i32 @header_mask_and_invariant_compare(ptr %A, ptr %B, ptr %C, ptr %D, pt
; PRED: loop.latch:
; PRED-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; PRED-NEXT: [[C_1:%.*]] = icmp eq i64 [[IV]], [[N]]
-; PRED-NEXT: br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP20:![0-9]+]]
+; PRED-NEXT: br i1 [[C_1]], label [[EXIT]], label [[LOOP_HEADER]], !llvm.loop [[LOOP18:![0-9]+]]
; PRED: exit:
; PRED-NEXT: ret i32 0
;
@@ -959,7 +880,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
; PRED-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX]], i64 [[TMP10]])
; PRED-NEXT: [[TMP16:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)
; PRED-NEXT: [[TMP17:%.*]] = extractelement <vscale x 2 x i1> [[TMP16]], i32 0
-; PRED-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
+; PRED-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP19:![0-9]+]]
; PRED: middle.block:
; PRED-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; PRED: scalar.ph:
@@ -979,7 +900,7 @@ define void @multiple_exit_conditions(ptr %src, ptr noalias %dst) #1 {
; PRED-NEXT: [[PTR_IV_NEXT]] = getelementptr i8, ptr [[PTR_IV]], i64 8
; PRED-NEXT: [[IV_CLAMP:%.*]] = and i64 [[IV]], 4294967294
; PRED-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_CLAMP]], 512
-; PRED-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP22:![0-9]+]]
+; PRED-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP20:![0-9]+]]
; PRED: exit:
; PRED-NEXT: ret void
;
@@ -1003,208 +924,34 @@ exit:
ret void
}
-define void @low_trip_count_fold_tail_scalarized_store(ptr %dst) {
-; DEFAULT-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+define void @low_trip_count_store(ptr %dst) {
+; DEFAULT-LABEL: define void @low_trip_count_store(
; DEFAULT-SAME: ptr [[DST:%.*]]) {
; DEFAULT-NEXT: entry:
-; DEFAULT-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
-; DEFAULT: vector.ph:
-; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]
-; DEFAULT: vector.body:
-; DEFAULT-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE14:%.*]] ]
-; DEFAULT-NEXT: [[VEC_IND:%.*]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE14]] ]
-; DEFAULT-NEXT: [[TMP0:%.*]] = trunc i64 [[INDEX]] to i8
-; DEFAULT-NEXT: [[TMP1:%.*]] = icmp ule <8 x i64> [[VEC_IND]], <i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6, i64 6>
-; DEFAULT-NEXT: [[TMP2:%.*]] = extractelement <8 x i1> [[TMP1]], i32 0
-; DEFAULT-NEXT: br i1 [[TMP2]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]]
-; DEFAULT: pred.store.if:
-; DEFAULT-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 0
-; DEFAULT-NEXT: [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP3]]
-; DEFAULT-NEXT: [[TMP5:%.*]] = add i8 [[TMP0]], 0
-; DEFAULT-NEXT: store i8 [[TMP5]], ptr [[TMP4]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE]]
-; DEFAULT: pred.store.continue:
-; DEFAULT-NEXT: [[TMP6:%.*]] = extractelement <8 x i1> [[TMP1]], i32 1
-; DEFAULT-NEXT: br i1 [[TMP6]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2:%.*]]
-; DEFAULT: pred.store.if1:
-; DEFAULT-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], 1
-; DEFAULT-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP7]]
-; DEFAULT-NEXT: [[TMP9:%.*]] = add i8 [[TMP0]], 1
-; DEFAULT-NEXT: store i8 [[TMP9]], ptr [[TMP8]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE2]]
-; DEFAULT: pred.store.continue2:
-; DEFAULT-NEXT: [[TMP10:%.*]] = extractelement <8 x i1> [[TMP1]], i32 2
-; DEFAULT-NEXT: br i1 [[TMP10]], label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4:%.*]]
-; DEFAULT: pred.store.if3:
-; DEFAULT-NEXT: [[TMP11:%.*]] = add i64 [[INDEX]], 2
-; DEFAULT-NEXT: [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP11]]
-; DEFAULT-NEXT: [[TMP13:%.*]] = add i8 [[TMP0]], 2
-; DEFAULT-NEXT: store i8 [[TMP13]], ptr [[TMP12]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE4]]
-; DEFAULT: pred.store.continue4:
-; DEFAULT-NEXT: [[TMP14:%.*]] = extractelement <8 x i1> [[TMP1]], i32 3
-; DEFAULT-NEXT: br i1 [[TMP14]], label [[PRED_STORE_IF5:%.*]], label [[PRED_STORE_CONTINUE6:%.*]]
-; DEFAULT: pred.store.if5:
-; DEFAULT-NEXT: [[TMP15:%.*]] = add i64 [[INDEX]], 3
-; DEFAULT-NEXT: [[TMP16:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP15]]
-; DEFAULT-NEXT: [[TMP17:%.*]] = add i8 [[TMP0]], 3
-; DEFAULT-NEXT: store i8 [[TMP17]], ptr [[TMP16]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE6]]
-; DEFAULT: pred.store.continue6:
-; DEFAULT-NEXT: [[TMP18:%.*]] = extractelement <8 x i1> [[TMP1]], i32 4
-; DEFAULT-NEXT: br i1 [[TMP18]], label [[PRED_STORE_IF7:%.*]], label [[PRED_STORE_CONTINUE8:%.*]]
-; DEFAULT: pred.store.if7:
-; DEFAULT-NEXT: [[TMP19:%.*]] = add i64 [[INDEX]], 4
-; DEFAULT-NEXT: [[TMP20:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP19]]
-; DEFAULT-NEXT: [[TMP21:%.*]] = add i8 [[TMP0]], 4
-; DEFAULT-NEXT: store i8 [[TMP21]], ptr [[TMP20]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE8]]
-; DEFAULT: pred.store.continue8:
-; DEFAULT-NEXT: [[TMP22:%.*]] = extractelement <8 x i1> [[TMP1]], i32 5
-; DEFAULT-NEXT: br i1 [[TMP22]], label [[PRED_STORE_IF9:%.*]], label [[PRED_STORE_CONTINUE10:%.*]]
-; DEFAULT: pred.store.if9:
-; DEFAULT-NEXT: [[TMP23:%.*]] = add i64 [[INDEX]], 5
-; DEFAULT-NEXT: [[TMP24:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP23]]
-; DEFAULT-NEXT: [[TMP25:%.*]] = add i8 [[TMP0]], 5
-; DEFAULT-NEXT: store i8 [[TMP25]], ptr [[TMP24]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE10]]
-; DEFAULT: pred.store.continue10:
-; DEFAULT-NEXT: [[TMP26:%.*]] = extractelement <8 x i1> [[TMP1]], i32 6
-; DEFAULT-NEXT: br i1 [[TMP26]], label [[PRED_STORE_IF11:%.*]], label [[PRED_STORE_CONTINUE12:%.*]]
-; DEFAULT: pred.store.if11:
-; DEFAULT-NEXT: [[TMP27:%.*]] = add i64 [[INDEX]], 6
-; DEFAULT-NEXT: [[TMP28:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP27]]
-; DEFAULT-NEXT: [[TMP29:%.*]] = add i8 [[TMP0]], 6
-; DEFAULT-NEXT: store i8 [[TMP29]], ptr [[TMP28]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE12]]
-; DEFAULT: pred.store.continue12:
-; DEFAULT-NEXT: [[TMP30:%.*]] = extractelement <8 x i1> [[TMP1]], i32 7
-; DEFAULT-NEXT: br i1 [[TMP30]], label [[PRED_STORE_IF13:%.*]], label [[PRED_STORE_CONTINUE14]]
-; DEFAULT: pred.store.if13:
-; DEFAULT-NEXT: [[TMP31:%.*]] = add i64 [[INDEX]], 7
-; DEFAULT-NEXT: [[TMP32:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP31]]
-; DEFAULT-NEXT: [[TMP33:%.*]] = add i8 [[TMP0]], 7
-; DEFAULT-NEXT: store i8 [[TMP33]], ptr [[TMP32]], align 1
-; DEFAULT-NEXT: br label [[PRED_STORE_CONTINUE14]]
-; DEFAULT: pred.store.continue14:
-; DEFAULT-NEXT: [[VEC_IND_NEXT]] = add <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; DEFAULT-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
-; DEFAULT-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
-; DEFAULT: middle.block:
-; DEFAULT-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
-; DEFAULT: scalar.ph:
-; DEFAULT-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ]
; DEFAULT-NEXT: br label [[LOOP:%.*]]
; DEFAULT: loop:
-; DEFAULT-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
+; DEFAULT-NEXT: [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
; DEFAULT-NEXT: [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i8
; DEFAULT-NEXT: [[GEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV]]
; DEFAULT-NEXT: store i8 [[IV_TRUNC]], ptr [[GEP]], align 1
; DEFAULT-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; DEFAULT-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 7
-; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP27:![0-9]+]]
+; DEFAULT-NEXT: br i1 [[EC]], label [[EXIT:%.*]], label [[LOOP]]
; DEFAULT: exit:
; DEFAULT-NEXT: ret void
;
-; PRED-LABEL: define void @low_trip_count_fold_tail_scalarized_store(
+; PRED-LABEL: define void @low_trip_count_store(
; PRED-SAME: ptr [[DST:%.*]]) {
; PRED-NEXT: entry:
-; PRED-NEXT: br i1 false, label [[SCALAR_PH...
[truncated]
|
; CHECK-NEXT: EMIT vp<[[CMP:%.+]]> = icmp ule ir<%iv>, vp<[[BTC]]> | ||
; CHECK-NEXT: Successor(s): pred.load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance this looks worse than before, unless I'm missing something. It looks like previously we were reusing the same predicated blocks to perform both the load and store, i.e. something like
%cmp = icmp ... <4 x i32>
%lane0 = extractelement <4 x i32> %cmp, i32 0
br i1 %lane0, label %block1.if, label %block1.continue
%block1.if:
.. do load ..
.. do store ..
...
whereas now we've essentially split out the loads and stores with duplicate control flow, i.e.
%cmp = icmp ... <4 x i32>
%lane0 = extractelement <4 x i32> %cmp, i32 0
br i1 %lane0, label %block1.load.if, label %block1.load.continue
%block1.if:
.. do load ..
...
%stores:
%lane0.1 = extractelement <4 x i32> %cmp, i32 0
br i1 %lane0.1, label %block1.store.if, label %block1.store.continue
...
I'd expect the extra control flow to hurt performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a bit strange as both before and after this patch this VPlan isn't used, as it decides the scalar version costs less. As to why it decides this VPlan is the best for this vector length, computePredInstDiscount doesn't currently account for any control flow instructions it could be able to remove by scalarizing everything, though I don't know how it would do that, as it's making a decision about whether to scalarize the chain of instructions feeding into a store, and the chose to scalarize the loads happens elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the best thing to do here is to change the test to check that it's making the correct decision, which is not to vectorize.
// tail folding, as then the vector cost is already the scalar cost. A | ||
// non-negative discount means the vector version of the instruction costs | ||
// more, and scalarizing would be beneficial. | ||
if (!foldTailByMasking() || getWideningDecision(I, VF) != CM_Scalarize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I'm not too familiar with this code and I need some time to understand what effect this change has on the costs, but I'll take a deeper look next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so this patch is saying that if block needs predication due to tail-folding and we've already decided to scalarise the instruction for the vector VF, then we shouldn't apply a discount. However, it feels like there two problems with this:
- I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?
- In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?
I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't see why we should restrict this to tail-folding only. If we've also made the widening decision to scalarise for non-tail-folded loops then surely we'd also not want to calculate the discount?
This is possibly true, but I tried that and there's a lot more test changes as a result, and briefly looking at them it wasn't immediately obvious if they were better or worse.
- In theory the discount should really be close to zero if we've made the decision to scalarise. Unless I've misunderstood something, it seems like a more fundamental problem here is why VectorCost is larger than ScalarCost for the scenario you're interested in? Perhaps ScalarCost is too low?
Looking at low_trip_count_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll the current sequence of events that happens is (for a vector length of 4):
- getMemInstScalarizationCost is called on the store, which needs to be scalarized and predicated because of tail masking, and calculates a cost of 20, which includes the cost of doing the compare and branch into the predicated block.
- computePredInstDiscount uses this as the VectorCost, as it uses getInstructionCost which for scalarized instruction returns the cost value in InstsToScalarize
- The calculation of ScalarCost assumes that the predicated block already exists, so it just calculates the cost of moving the instruction into the predicated block and scalarizing it in there, which it calculates as 4 (for 4 copies of the store)
- This results in a discount of 16, causing 20-16=4 to be used as the cost of the scalarized store.
ScalarCost is too low for the scalarized store, because it's assuming that the predicated block already exists, but the cost of the predicated block is exactly what we need to take into account to avoid pointless tail folding by masking.
I think it would be helpful to add some cost model tests to this patch that have some debug output showing how the costs for each VF change. What seems to be happening with this change is that we're now reporting higher loop costs for VF > 1, and this is leading to the decision not to vectorise at all.
I'll add these tests.
; CHECK: LV: Vector loop of width 4 costs: 4. | ||
; CHECK: LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ] | ||
; CHECK: LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv | ||
; CHECK: LV: Found an estimated cost of 32 for VF 8 For instruction: store i8 1, ptr %gep, align 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1. I suspect what your change does is increase the effective cost of the store by saying that we shouldn't scalarise it. However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?
Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! However, I still don't know how your patch affects the cost model. For example, what were the costs before your change since that helps to understand why this patch now chooses VF=1
Current costs, without this patch:
LV: Found an estimated cost of 0 for VF 1 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 0 for VF 1 For instruction: %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 1 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 1 For instruction: store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 1 For instruction: %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %ec, label %exit, label %loop
LV: Scalar loop costs: 4.
LV: Found an estimated cost of 0 for VF 2 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 2 for VF 2 For instruction: %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 2 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 2 for VF 2 For instruction: store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 2 For instruction: %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %ec, label %exit, label %loop
LV: Vector loop of width 2 costs: 3.
LV: Found an estimated cost of 0 for VF 4 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 4 for VF 4 For instruction: %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 4 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 4 for VF 4 For instruction: store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 2 for VF 4 For instruction: %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 4 For instruction: %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 4 For instruction: br i1 %ec, label %exit, label %loop
LV: Vector loop of width 4 costs: 2.
LV: Found an estimated cost of 0 for VF 8 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 8 for VF 8 For instruction: %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 8 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 8 for VF 8 For instruction: store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 4 for VF 8 For instruction: %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 8 For instruction: %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 8 For instruction: br i1 %ec, label %exit, label %loop
LV: Vector loop of width 8 costs: 2.
LV: Found an estimated cost of 0 for VF 16 For instruction: %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
LV: Found an estimated cost of 16 for VF 16 For instruction: %iv.trunc = trunc i64 %iv to i8
LV: Found an estimated cost of 0 for VF 16 For instruction: %gep = getelementptr i8, ptr %dst, i64 %iv
LV: Found an estimated cost of 16 for VF 16 For instruction: store i8 %iv.trunc, ptr %gep, align 1
LV: Found an estimated cost of 8 for VF 16 For instruction: %iv.next = add i64 %iv, 1
LV: Found an estimated cost of 1 for VF 16 For instruction: %ec = icmp eq i64 %iv.next, 7
LV: Found an estimated cost of 0 for VF 16 For instruction: br i1 %ec, label %exit, label %loop
LV: Vector loop of width 16 costs: 2.
LV: Selecting VF: 8.
However, my concern here is that we might be fixing this the wrong way. For example, it sounds like we could achieve the same effect if in computePredInstDiscount we increased the scalarisation cost to match the vector cost, so that the discount disappears?
That sounds like doing more work for the same end result? We already know that there can be no discount, because the "vector cost" for an instruction we've already decided to scalarize is already the scalarized cost, so there's no point in calculating the same thing again.
Let's suppose we force the vectoriser to choose VF=8 and in one scenario we choose to scalarise, and in the other we vectorise what would the end result look like? If the generated code looks equally bad in both cases then I'd expect the vector cost and scalar cost to be the same. I worry that we're effectively hiding a bug in the cost model by simply bypassing it for tail-folded loops, however the problem may still remain for normal loops with control flow.
If we've already decided to scalarize by the time we've called computePredInstDiscount (i.e. getWideningDecision returns CM_Scalarize) then that means using a vector instruction is impossible (because no vector instruction exists for the operation we're trying to do, in this case we're not using sve so we don't have a masked vector store instruction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the end result may be the same, but by fixing the cost model we hopefully fix one of the root causes and potentially fix other hidden problems that haven't yet come to light.
Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.
One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see isLegalMaskedStore
and isLegalMaskedLoad
in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another root cause here is that we shouldn't even be attempting to tail-fold loops with loads and stores in them if the target does not support masked loads and stores. It's a poor upfront decision made by the vectoriser with no analysis of the trade-off between tail-folding and not tail-folding.
It appears it used to be the case that not having masked loads and stores disabled tail folding, but 9d1b2acaaa224 changed it to using a cost model instead.
One potential alternative to this patch that I think is worth exploring is before deciding to tail-fold for low trip count loops we can query the target's support for masked loads and stores (see
isLegalMaskedStore
andisLegalMaskedLoad
in LoopVectorize.cpp). If the target says they are not legal just don't bother tail-folding as it's almost certainly not worth it. Perhaps the end result may be even faster, since we may create an unpredicated vector body for the first X iterations? What do you think? @fhahn any thoughts?
TargetTransformInfo::preferPredicateOverEpilogue already defaults to preferring a scalar epilogue, and the two targets that implement their own version (AArch64 and ARM) prefer scalar epilogue when we don't have masked loads and stores. So the only situation where we're even considering using tail predication if we don't have masked loads/stores is when using a scalar epilogue is impossible, so the choice is between vectorizing using tail folding or not vectorizing at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I don't actually mean changing preferPredicateOverEpilogue because this is a top level decision that's used in general for all loops, regardless of the trip count, code size concerns due to -Os/-Oz. I meant the code where we decide to vectorise due to the low trip count, which can still happen even when preferPredicateOverEpilogue returns false. For example, if you look in the function LoopVectorizationCostModel::computeMaxVF at this code here:
// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
if (foldTailByMasking()) {
I added this extra code just before:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedLowTripLoop) {
for (BasicBlock *BB : TheLoop->getBlocks()) {
for (Instruction &I : *BB) {
if (isa<LoadInst>(&I) || isa<StoreInst>(&I)) {
auto *Ptr = getLoadStorePointerOperand(&I);
auto *Ty = getLoadStoreType(&I);
const Align Alignment = getLoadStoreAlignment(&I);
if ((isa<LoadInst>(&I) && !isLegalMaskedLoad(Ty, Ptr, Alignment)) ||
(isa<StoreInst>(&I) && !isLegalMaskedStore(Ty, Ptr, Alignment))) {
LLVM_DEBUG(dbgs() << "LV: Not tail-folding due to lack of masked load/store support\n");
// We could also just return FixedScalableVFPair::getNone() here.
ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
return MaxFactors;
}
}
}
}
}
And now the code for @store_const_fixed_trip_count
looks something like this:
define void @store_const_fixed_trip_count(ptr %dst) {
entry:
br i1 false, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %entry
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%0 = add i64 %index, 0
%1 = getelementptr i8, ptr %dst, i64 %0
%2 = getelementptr i8, ptr %1, i32 0
store <4 x i8> <i8 1, i8 1, i8 1, i8 1>, ptr %2, align 1
%index.next = add nuw i64 %index, 4
%3 = icmp eq i64 %index.next, 4
br i1 %3, label %middle.block, label %vector.body, !llvm.loop !0
middle.block: ; preds = %vector.body
br i1 false, label %exit, label %scalar.ph
scalar.ph: ; preds = %middle.block, %entry
%bc.resume.val = phi i64 [ 4, %middle.block ], [ 0, %entry ]
br label %loop
loop: ; preds = %loop, %scalar.ph
%iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %iv.next, %loop ]
%gep = getelementptr i8, ptr %dst, i64 %iv
store i8 1, ptr %gep, align 1
%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 7
br i1 %ec, label %exit, label %loop, !llvm.loop !3
exit: ; preds = %middle.block, %loop
ret void
}
The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.
This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.
Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix you have for computePredInstDiscount could be used as a workaround for now, but ultimately I think the best solution would be to avoid tail-folding altogether.
There are situations where tail-folding is faster than not vectorizing, e.g. the following example
__attribute__((noinline))
void fn(char *p, char a, char b, char c) {
for (char i = 0; i < 7; ++i) {
p[i] = i*a + (i>>1)*b + (i>>2)*c;
}
}
int main() {
char arr[256];
fn(arr, 3, 5, 7);
return arr[0];
}
On cortex-a53 compiling with -O3 -fno-loop-unroll, fn takes 65 cycles if we vectorize using tail folding and 74 cycles if we don't vectorize. I don't know how realistic this is as an example though, as without loop unrolling disabled it gets unrolled and that takes 43 cycles, but it does show that there are situations where just disabling tail folding outright leads to a worse result.
Also in general I think it's better to make decisions based on calculating the cost of things if we can, unless we can guarantee that a certain decision is definitely going to be worse, which we can't do here. I also don't think the change I'm making to computePredInstDiscount is a workaround, as what it's doing is making the cost we calculate more accurate to the actual cost of doing this, and more accurate costs should lead to better decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the reason to force tail-folding with Oz/Os is to limit code size by avoiding having a vector loop and a scalar remainder loop.
This of course goes completely wrong in this case, as the code size with tail folded is much higher than a vector and scalar loop.
Perhaps we should directly check for that, i.e. by avoiding vectorizing requiring (predicated) replication with Oz/Os?
Perhaps, but there's a general problem that the current cost calculations for Os are just wrong: we use TCK_RecipThroughput everywhere, and for predicated instructions we scale the cost by the block probability. I'll be working on a patch after this to try and fix at least some of this.
I compiled these tests that all end up using predicated blocks, using the command "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289.c":
Compile this test with "clang -O3 -mcpu=generic -fno-unroll-loops -S ./pr109289_optsize.c":
The IR before the loop vectoriser looks like this: == pr109289.c ==
== pr109289_optsize.c ==
In all 3 tests the vectoriser decides to vectorise as follows: == lowtrip == == conditional == == optsize == In all three cases we have made the wrong decision based on the currently generated code. These are some example runtimes with vectorisation of all 3 (I had to force vectorisation of conditional) on neoverse-v1 (I ran each function in a loop many times and timed them): lowtrip: 50 ms Without vectorisation on neoverse-v1: lowtrip: 26 ms On the surface it appears for lowtrip and optsize we shouldn't be vectorising, but for conditional we should. When I dug into this it seems that although conditional should suffer the same problem as lowtrip and optsize, the codegen is actually much better. This is the final IR for lowtrip and conditional:
You can see what's happened after the CodeGenPrepare pass for each test:
For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop. In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact. Instead, I think if you're building with -Os and know that the blocks are going to be predicated due to tail-folding and replicated then we should just bail out immediately. For example, I think that It looks like there is also follow-on work:
|
Sorry, please ignore my comment about |
Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result and register pressure is much lower. Given they are likely to be more expensive than scalar comparisons it makes sense to avoid sinking too many. The CodeGen/SystemZ/vec-perm-14.ll test does rely upon sinking a vector comparison so I've kept that behaviour by permitting one sink. Alternatively, I could also introduce a TLI hook to query the target if this is a preferred solution?
I don't think this is correct. In the function "conditional" without vectorizing the "i & 0x1" results in 16 comparisons (one for each iteration), and vectorizing by 8 reduces it to two vector comparisons, so vectorizing reduces the number of comparison instructions. In "lowtrip" the tail folding introduces two vector comparisons where we had no comparison before, so vectorizing increases the number of comparison instructions. I went and measured this on cortex-a53, and the total time for 128 iterations of the lowtrip function is:
This patch doesn't always prefer a scalar loop. It fixes an error in the cost calculation, which leads to us choosing a scalar loop when it's faster than a vector loop. My example in the comment above (#109289 (comment)) shows a case where with this patch we still vectorize, as the calculated cost shows that it's faster than a scalar loop. |
Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.
Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.
Ping |
Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.
Whilst reviewing PR llvm#109289 and doing some analysis with various tests involving predicated blocks I noticed that we're making codegen and performance worse by sinking vector comparisons multiple times into blocks. It looks like the sinkCmpExpression in CodeGenPrepare was written for scalar comparisons where there is only a single condition register, whereas vector comparisons typically produce a vector result. For some targets, such a NEON or SVE, there are multiple allocatable vector registers that can store the result and so we should avoid sinking in that case.
Ping. |
Hi @john-brawn-arm, thanks for taking the time to look into the
Ideally we'd tackle the code size issue presented in #66652 |
The general problem with vectorization cost calculations is, as I've mentioned before, that everything uses the reciprocal throughput as the cost kind, but we should be using code size when optimizing for code size. I was going to fix that after this, as it's a much larger change, and we'll still need this patch as the cost calculation that it fixes is currently wrong no matter what cost kind is being used, but I'll get to work on getting it done and then get back to this. |
When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions.
Fixes #66652