-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] boolean or+sext can be done with addhn #125611
Comments
@llvm/issue-subscribers-backend-aarch64 Author: dzaima (dzaima)
https://godbolt.org/z/n3edx6Ko3
This function: #include<stdbool.h>
#include<arm_neon.h>
bool foo(float64x2_t x, float64x2_t y) {
uint32x2_t any_zeroes = vaddhn_u64(
vceqzq_f64(x),
vceqzq_f64(y)
);
return vget_lane_f64((float64x1_t)any_zeroes, 0) != 0;
} compiles as: foo:
fcmeq v0.2d, v0.2d, #<!-- -->0.0
fcmeq v1.2d, v1.2d, #<!-- -->0.0
orr v0.16b, v0.16b, v1.16b
xtn v0.2s, v0.2d
fcmp d0, #<!-- -->0.0
cset w0, ne
ret although the addhn is better: foo:
fcmeq v0.2d, v0.2d, 0
fcmeq v1.2d, v1.2d, 0
addhn v1.2s, v0.2d, v1.2d
fcmp d1, #<!-- -->0.0
cset w0, ne
ret define dso_local i1 @<!-- -->foo(<2 x double> noundef %x, <2 x double> noundef %y) local_unnamed_addr {
entry:
%0 = fcmp oeq <2 x double> %x, zeroinitializer
%1 = fcmp oeq <2 x double> %y, zeroinitializer
%2 = or <2 x i1> %0, %1
%vaddhn2.i = sext <2 x i1> %2 to <2 x i32>
%3 = bitcast <2 x i32> %vaddhn2.i to <1 x double>
%vget_lane = extractelement <1 x double> %3, i64 0
%cmp = fcmp une double %vget_lane, 0.000000e+00
ret i1 %cmp
} |
It looks like what's happening is:
|
This looks like a case where a mismatch between how we represent things in IR and how the instruction set does things is leading to a target-independent pass doing something suboptimal. IR has vector fcmp instructions returning a vector of i1, but the fcmeq instruction has output register of same size as input register, so this gets represented in IR as fcmp then sext. Similarly there's no direct equivalent to addhn in IR, so it is represented as an equivalent instruction sequence. The result is that in InstCombine it looks like we're removing instructions, but the end result is that we end up with more instructions. I don't know what the best way to solve this is, perhaps we need to somehow undo this in instruction selection and realize that "or sext" can be generated as "addhn" in some situations. |
Similar code without using vector instrinsics that triggers the same instcombine transformation:
|
Explanation by @ostannard for why both code are correct even though one use 'or'+'narrow' and the other 'add'+'narrow'
|
Don't know what standards LLVM has for stuff like this, but I'd imagine that having an architecture-specific LLVM intrinsics for With that, I can't imagine much intrinsic-less code actually doing what my original code does; if anything, the useful thing would be generally using (on an unrelated note of the reverse, seems |
We probably just need to undo this in the backend I would expect, it will just need to use the number of sign bits to be sure it is valid to transform back. Using an fcmp unfortunately isn't valid with denormal flushing, see #115713 for where we tried that a little while ago. It makes it more difficult to use in general in the compiler. |
Ah yeah, having to consider denormal flushing messes with that in general; x86 had this attempt similarly. :/ |
but clang undoes this sometimes :/ llvm/llvm-project#125611
https://godbolt.org/z/n3edx6Ko3
This function:
compiles as:
although the addhn is better:
The text was updated successfully, but these errors were encountered: