Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

Open
ImpleLee opened this issue Jan 18, 2025 · 4 comments · May be fixed by #123466

Comments

@ImpleLee
Copy link

The following code uses libstdc++ experimental simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v .

#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::int32_t, 4>;

bool simple_ptest(data_t x) {
    return all_of(x == 0);
}

bool ptest_and(data_t a, data_t b) {
    return all_of((a & b) == 0);
}

bool ptest_andn(data_t a, data_t b) {
    return all_of((a & ~b) == 0);
}

Equivalent assembly (hand-written):

simple_ptest:
        vptest  %xmm0, %xmm0
        sete    %al
        ret
ptest_and:
        vptest  %xmm0, %xmm1
        sete    %al
        ret
ptest_andn:
        vptest  %xmm0, %xmm1
        setc    %al
        ret

But clang++ generates the following code at -O3 -march=x86-64-v3.

simple_ptest(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_and(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpand   xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_andn(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpandn  xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

reference: the same issue at gcc bugzilla (identified as a duplicate to another missed optimization).

@llvmbot
Copy link
Member

llvmbot commented Jan 18, 2025

@llvm/issue-subscribers-backend-x86

Author: Imple Lee (ImpleLee)

The following code uses libstdc++ experimental simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v .
#include &lt;experimental/simd&gt;
#include &lt;cstdint&gt;
namespace stdx = std::experimental;

template &lt;class T, std::size_t N&gt;
using simd_of = stdx::simd&lt;T, stdx::simd_abi::deduce_t&lt;T, N&gt;&gt;;

using data_t = simd_of&lt;std::int32_t, 4&gt;;

bool simple_ptest(data_t x) {
    return all_of(x == 0);
}

bool ptest_and(data_t a, data_t b) {
    return all_of((a &amp; b) == 0);
}

bool ptest_andn(data_t a, data_t b) {
    return all_of((a &amp; ~b) == 0);
}

Equivalent assembly (hand-written):

simple_ptest:
        vptest  %xmm0, %xmm0
        sete    %al
        ret
ptest_and:
        vptest  %xmm0, %xmm1
        sete    %al
        ret
ptest_andn:
        vptest  %xmm0, %xmm1
        setc    %al
        ret

But clang++ generates the following code at -O3 -march=x86-64-v3.

simple_ptest(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_and(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;, std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpand   xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_andn(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;, std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpandn  xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

reference: the same issue at gcc bugzilla (identified as a duplicate to another missed optimization).

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 18, 2025

Should be relatively trivial to add any missing patterns to combinePTESTCC

RKSimon added a commit that referenced this issue Jan 18, 2025
@ImpleLee
Copy link
Author

I am not pretty sure the assembly I wrote by hand is 100% correct (the argument order etc). Make sure to double check!

@RKSimon RKSimon self-assigned this Jan 18, 2025
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jan 18, 2025
Simplifies the hidden "all_of(X == 0)" pattern

Fixes llvm#123456
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 18, 2025

@ImpleLee Yes, I found the commutation typo :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants