Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] boolean or+sext can be done with addhn #125611

Open
dzaima opened this issue Feb 4, 2025 · 8 comments
Open

[AArch64] boolean or+sext can be done with addhn #125611

dzaima opened this issue Feb 4, 2025 · 8 comments

Comments

@dzaima
Copy link

dzaima commented Feb 4, 2025

https://godbolt.org/z/n3edx6Ko3

This function:

#include<stdbool.h>
#include<arm_neon.h>
bool foo(float64x2_t x, float64x2_t y) {
  uint32x2_t any_zeroes = vaddhn_u64(
    vceqzq_f64(x),
    vceqzq_f64(y)
  );
  return vget_lane_f64((float64x1_t)any_zeroes, 0) != 0;
}

compiles as:

foo:
        fcmeq   v0.2d, v0.2d, #0.0
        fcmeq   v1.2d, v1.2d, #0.0
        orr     v0.16b, v0.16b, v1.16b
        xtn     v0.2s, v0.2d
        fcmp    d0, #0.0
        cset    w0, ne
        ret

although the addhn is better:

foo:
        fcmeq   v0.2d, v0.2d, 0
        fcmeq   v1.2d, v1.2d, 0
        addhn   v1.2s, v0.2d, v1.2d
        fcmp    d1, #0.0
        cset    w0, ne
        ret
define dso_local i1 @foo(<2 x double> noundef %x, <2 x double> noundef %y) local_unnamed_addr {
entry:
  %0 = fcmp oeq <2 x double> %x, zeroinitializer
  %1 = fcmp oeq <2 x double> %y, zeroinitializer
  %2 = or <2 x i1> %0, %1
  %vaddhn2.i = sext <2 x i1> %2 to <2 x i32>
  %3 = bitcast <2 x i32> %vaddhn2.i to <1 x double>
  %vget_lane = extractelement <1 x double> %3, i64 0
  %cmp = fcmp une double %vget_lane, 0.000000e+00
  ret i1 %cmp
}
@llvmbot
Copy link
Member

llvmbot commented Feb 4, 2025

@llvm/issue-subscribers-backend-aarch64

Author: dzaima (dzaima)

https://godbolt.org/z/n3edx6Ko3

This function:

#include&lt;stdbool.h&gt;
#include&lt;arm_neon.h&gt;
bool foo(float64x2_t x, float64x2_t y) {
  uint32x2_t any_zeroes = vaddhn_u64(
    vceqzq_f64(x),
    vceqzq_f64(y)
  );
  return vget_lane_f64((float64x1_t)any_zeroes, 0) != 0;
}

compiles as:

foo:
        fcmeq   v0.2d, v0.2d, #<!-- -->0.0
        fcmeq   v1.2d, v1.2d, #<!-- -->0.0
        orr     v0.16b, v0.16b, v1.16b
        xtn     v0.2s, v0.2d
        fcmp    d0, #<!-- -->0.0
        cset    w0, ne
        ret

although the addhn is better:

foo:
        fcmeq   v0.2d, v0.2d, 0
        fcmeq   v1.2d, v1.2d, 0
        addhn   v1.2s, v0.2d, v1.2d
        fcmp    d1, #<!-- -->0.0
        cset    w0, ne
        ret
define dso_local i1 @<!-- -->foo(&lt;2 x double&gt; noundef %x, &lt;2 x double&gt; noundef %y) local_unnamed_addr {
entry:
  %0 = fcmp oeq &lt;2 x double&gt; %x, zeroinitializer
  %1 = fcmp oeq &lt;2 x double&gt; %y, zeroinitializer
  %2 = or &lt;2 x i1&gt; %0, %1
  %vaddhn2.i = sext &lt;2 x i1&gt; %2 to &lt;2 x i32&gt;
  %3 = bitcast &lt;2 x i32&gt; %vaddhn2.i to &lt;1 x double&gt;
  %vget_lane = extractelement &lt;1 x double&gt; %3, i64 0
  %cmp = fcmp une double %vget_lane, 0.000000e+00
  ret i1 %cmp
}

@john-brawn-arm
Copy link
Collaborator

It looks like what's happening is:

  • In clang/lib/CodeGen/CGBuiltin.cpp the vaddhn is generated as "add lshr trunc", and vceqz is generated as "fcmp sext"
  • InstCombine combines the sext with the "add lshr trunc" to form "or sext"

@john-brawn-arm
Copy link
Collaborator

This looks like a case where a mismatch between how we represent things in IR and how the instruction set does things is leading to a target-independent pass doing something suboptimal. IR has vector fcmp instructions returning a vector of i1, but the fcmeq instruction has output register of same size as input register, so this gets represented in IR as fcmp then sext. Similarly there's no direct equivalent to addhn in IR, so it is represented as an equivalent instruction sequence.

The result is that in InstCombine it looks like we're removing instructions, but the end result is that we end up with more instructions. I don't know what the best way to solve this is, perhaps we need to somehow undo this in instruction selection and realize that "or sext" can be generated as "addhn" in some situations.

@john-brawn-arm
Copy link
Collaborator

Similar code without using vector instrinsics that triggers the same instcombine transformation:

int bar(double x, double y) {
  signed long xcmp = ((signed long)(x==0) << 63) >> 63;
  signed long ycmp = ((signed long)(y==0) << 63) >> 63;
  return ((xcmp + ycmp) >> 32) != 0;
}

@sivan-shani
Copy link
Contributor

Explanation by @ostannard for why both code are correct even though one use 'or'+'narrow' and the other 'add'+'narrow'

  • Those fcmeq instructions do not have the same operands for both inputs, they are comparing the second assembly operand to the third (the immediate 0.0), and writing the result to the first assembly operand
  • The fcmeq instruction writes a vector of 2 64-bit lanes, which are each either all-zero or all-one
  • The vaddhn adds those all-zero/one lanes, so the intermediate result has lanes of all zeroes if both operand lanes were zero, and either all-one or 111...110 otherwise. The instruction then takes the top half of each intermediate result and places it in a 32-bit lane of the result register, so we're back to all-zero or all-one
  • The xtn instruction does the narrowing step, giving the same 32-bit lanes as the vaddhn

@dzaima
Copy link
Author

dzaima commented Feb 7, 2025

Don't know what standards LLVM has for stuff like this, but I'd imagine that having an architecture-specific LLVM intrinsics for addhn/subhn instead of trying to expand it to basic LLVM IR would be fine; can't imagine much benefit from general optimizations for people using the intrinsic.

With that, I can't imagine much intrinsic-less code actually doing what my original code does; if anything, the useful thing would be generally using addhn+fcmp #0.0 as a check for "is any element of two vectors true (prerequisite being the vectors having ≥16-bit bit-homogenous elements; my original example showing 64-bit elements)" (comparison vs autovectorizedalive2; lemire blog that started me looking into this)

(on an unrelated note of the reverse, seems vqaddq_s8 & co map to @llvm.aarch64.neon.sqadd.v16i8, whereas llvm has a native @llvm.sadd.sat.v16i8, which I'd think should have the same behavior; maybe worth a separate issue?)

@davemgreen
Copy link
Collaborator

We probably just need to undo this in the backend I would expect, it will just need to use the number of sign bits to be sure it is valid to transform back.

Using an fcmp unfortunately isn't valid with denormal flushing, see #115713 for where we tried that a little while ago. It makes it more difficult to use in general in the compiler.

@dzaima
Copy link
Author

dzaima commented Feb 8, 2025

Ah yeah, having to consider denormal flushing messes with that in general; x86 had this attempt similarly. :/

dzaima added a commit to dzaima/CBQN that referenced this issue Feb 8, 2025
but clang undoes this sometimes :/ llvm/llvm-project#125611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants