-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bizarre optimization of simple array-filling loop in -Os #66652
Comments
yeah even with -fno-unroll-loops it is producing so much code. https://godbolt.org/z/aGbs7Yos6 |
RISCV code OTOH: https://godbolt.org/z/5jf97WM3a |
Try sve // clang++ -Os -march=armv8-a+sve
fill_i16: // @fill_i16
cbz x2, .LBB0_3
cnth x8
mov z0.h, w1
mov x10, xzr
subs x9, x2, x8
csel x9, xzr, x9, lo
whilelo p0.h, xzr, x2
.LBB0_2: // =>This Inner Loop Header: Depth=1
st1h { z0.h }, p0, [x0, x10, lsl #1]
whilelo p0.h, x10, x9
add x10, x10, x8
b.mi .LBB0_2
.LBB0_3:
ret |
@ilinpv suggested this could be in vectorizer and that's why -fno-unroll-loops doesn't have any effect. |
Yeah, my guess of it being unrolling seems to be wrong; |
Considering the SVE & RVV behavior (and also AVX-512), it's presumably trying to do a predicated tail, but as there's no masked store on SSE2/NEON, it falls back to branchy stores without properly considering the cost (and then also happens to do the mask extracts very inefficiently)? Adding any load to the body makes it go to a regular scalar loop, Compiler Explorer's "Opt viewer" giving a "Missed - the cost-model indicates that vectorization is not beneficial". (for fun, here's an even worse case - AVX2 & int8_t elements makes it consider 32 elements/iteration, targeting 256-bit vectors) |
There is a Hack in the vectorizer cost model to give very high costs (3000000) to predicated vector loads/stores operations, but it only applies if there are loads or >1 store. See useEmulatedMaskMemRefHack. So this case falls through the cracks and doesn't get as high cost as expected. The option -vectorize-num-stores-pred=0 would fix the issue by not excluding the case in this ticket where there is a single store. I presume in the long run we would want to work on removing the Hack. |
On x86 shouldn't this be turned into a |
yup. and maybe in general |
Looking at this it looks like there are two problems here:
|
…lding When an instruction is scalarized due to tail folding the cost that we calculate (which is then returned by getWideningCost and thus getInstructionCost) is already the scalarized cost, so computePredInstDiscount shouldn't apply a scalarization discount to these instructions. Fixes llvm#66652
This code:
compiled with
-Os
on either x86-64 or aarch64, leads to bizarre code, e.g. x86-64:bizarre x86-64
Compiler Explorer: https://godbolt.org/z/9jfYh8Wz8; it appears the first version with this behavior was clang 12.
Use of SIMD here (the loop is unrolled, not vectorized) is completely unnecessary; the output of
-Oz
, which is a simple scalar non-unrolled loop, is ~5x faster in a simple test.The "optimized" IR in question:
IR
The text was updated successfully, but these errors were encountered: