missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

ImpleLee · 2025-01-18T08:43:16Z

The following code uses libstdc++ experimental simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v .

#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::int32_t, 4>;

bool simple_ptest(data_t x) {
    return all_of(x == 0);
}

bool ptest_and(data_t a, data_t b) {
    return all_of((a & b) == 0);
}

bool ptest_andn(data_t a, data_t b) {
    return all_of((a & ~b) == 0);
}

Equivalent assembly (hand-written):

simple_ptest:
        vptest  %xmm0, %xmm0
        sete    %al
        ret
ptest_and:
        vptest  %xmm0, %xmm1
        sete    %al
        ret
ptest_andn:
        vptest  %xmm0, %xmm1
        setc    %al
        ret

But clang++ generates the following code at -O3 -march=x86-64-v3.

simple_ptest(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_and(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpand   xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_andn(std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>, std::experimental::parallelism_v2::simd<int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<16>>):
        vpandn  xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

reference: the same issue at gcc bugzilla (identified as a duplicate to another missed optimization).

The text was updated successfully, but these errors were encountered:

llvmbot · 2025-01-18T11:10:06Z

@llvm/issue-subscribers-backend-x86

Author: Imple Lee (ImpleLee)

The following code uses libstdc++ experimental simd, and wants to detect several all-zero patterns that can be easily done with the vptest instructions. All the code is available at https://godbolt.org/z/Kx68E1T6v .

#include &lt;experimental/simd&gt;
#include &lt;cstdint&gt;
namespace stdx = std::experimental;

template &lt;class T, std::size_t N&gt;
using simd_of = stdx::simd&lt;T, stdx::simd_abi::deduce_t&lt;T, N&gt;&gt;;

using data_t = simd_of&lt;std::int32_t, 4&gt;;

bool simple_ptest(data_t x) {
    return all_of(x == 0);
}

bool ptest_and(data_t a, data_t b) {
    return all_of((a &amp; b) == 0);
}

bool ptest_andn(data_t a, data_t b) {
    return all_of((a &amp; ~b) == 0);
}

Equivalent assembly (hand-written):

simple_ptest:
        vptest  %xmm0, %xmm0
        sete    %al
        ret
ptest_and:
        vptest  %xmm0, %xmm1
        sete    %al
        ret
ptest_andn:
        vptest  %xmm0, %xmm1
        setc    %al
        ret

But clang++ generates the following code at -O3 -march=x86-64-v3.

simple_ptest(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_and(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;, std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpand   xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

ptest_andn(std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;, std::experimental::parallelism_v2::simd&lt;int, std::experimental::parallelism_v2::simd_abi::_VecBuiltin&lt;16&gt;&gt;):
        vpandn  xmm0, xmm1, xmm0
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqd        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vptest  xmm0, xmm1
        setb    al
        ret

reference: the same issue at gcc bugzilla (identified as a duplicate to another missed optimization).

RKSimon · 2025-01-18T12:38:28Z

Should be relatively trivial to add any missing patterns to combinePTESTCC

ImpleLee · 2025-01-18T16:14:19Z

I am not pretty sure the assembly I wrote by hand is 100% correct (the argument order etc). Make sure to double check!

Simplifies the hidden "all_of(X == 0)" pattern Fixes llvm#123456

RKSimon · 2025-01-18T16:16:40Z

@ImpleLee Yes, I found the commutation typo :)

llvmbot added the new issue label Jan 18, 2025

abhishek-kaushik22 added backend:X86 missed-optimization and removed new issue labels Jan 18, 2025

RKSimon added a commit that referenced this issue Jan 18, 2025

[X86] Add test coverage for #123456

fcedf98

RKSimon self-assigned this Jan 18, 2025

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jan 18, 2025

[X86] combinePTESTCC - fold PTESTC(PCMPEQ(X,0),-1) == PTESTZ(X,X)

c5941fe

Simplifies the hidden "all_of(X == 0)" pattern Fixes llvm#123456

RKSimon linked a pull request Jan 18, 2025 that will close this issue

[X86] combinePTESTCC - fold PTESTC(PCMPEQ(X,0),-1) == PTESTZ(X,X) #123466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

ImpleLee commented Jan 18, 2025

llvmbot commented Jan 18, 2025

RKSimon commented Jan 18, 2025

ImpleLee commented Jan 18, 2025

RKSimon commented Jan 18, 2025

missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

missed optimization to simple vptest when using libstdc++ std::experimental::simd to detect all-zero pattern on x86-64 avx #123456

Comments

ImpleLee commented Jan 18, 2025

llvmbot commented Jan 18, 2025

RKSimon commented Jan 18, 2025

ImpleLee commented Jan 18, 2025

RKSimon commented Jan 18, 2025