Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive moving of SIMD registers on x86-64 #81391

Open
dzaima opened this issue Feb 11, 2024 · 1 comment
Open

Excessive moving of SIMD registers on x86-64 #81391

dzaima opened this issue Feb 11, 2024 · 1 comment

Comments

@dzaima
Copy link

dzaima commented Feb 11, 2024

Apologies for the unreadable autogenerated C code, but hopefully that doesn't matter much for the issue in question.

The code here generates assembly which contains within it this excerpt:

        vmovdqu ymmword ptr [rsp + 112], ymm2   # 32-byte Spill
        vmovdqa ymm2, ymm14
        vmovdqa xmm14, xmm1
        vmovdqa ymm1, ymm9
        vmovdqa ymm9, ymm8
        vmovdqa ymm8, ymm6
        vmovdqa ymm6, ymm5
        vmovdqa ymm5, ymm3
        vmovdqa xmm3, xmm12
        vmovq   xmm12, qword ptr [rbx + rax + 8] # xmm12 = mem[0],zero
        vmovdqa ymm4, ymm15
        vmovdqa ymm15, ymm13
        vmovd   xmm13, edx
        vpinsrd xmm13, xmm13, ecx, 1
        vpmaxsd xmm12, xmm12, xmm13
        vmovq   xmm13, qword ptr [rbx + rax]    # xmm13 = mem[0],zero
        vpbroadcastd    xmm0, dword ptr [rip + .LCPI0_17] # xmm0 = [1,1,1,1]
        vpinsrd xmm0, xmm0, ecx, 0
        vpaddd  xmm0, xmm13, xmm0
        vmovdqa ymm13, ymm15
        vmovdqa ymm15, ymm4
        vpunpcklqdq     xmm0, xmm0, xmm12       # xmm0 = xmm0[0],xmm12[0]
        vmovdqa xmm12, xmm3
        vmovdqa ymm3, ymm5
        vmovdqa ymm5, ymm6
        vmovdqa ymm6, ymm8
        vmovdqa ymm8, ymm9
        vmovdqa ymm9, ymm1
        vmovdqa xmm1, xmm14
        vmovdqa ymm14, ymm2
        vmovdqu ymm2, ymmword ptr [rsp + 112]   # 32-byte Reload

That's 9 registers moved forwards and then later back, none of which are used in the code between (and no there are no jumps to the middle of this).

The C code does include an __asm__ mutating a __m256i (for the purpose of preventing merging simple shuffles into vpermds to reduce register pressure), but that's in a different place and is assigned to ymm7 which does not feature in the problematic excerpt (vmovapd is added within OFENCE_V just to demonstrate this). Nevertheless, replacing OFENCE_V with #define OFENCE_V(X) X gets rid of the problem (perhaps by chance).

@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2024

@llvm/issue-subscribers-backend-x86

Author: dzaima (dzaima)

Apologies for the unreadable autogenerated C code, but hopefully that doesn't matter much for the issue in question.

The code here generates assembly which contains within it this excerpt:

        vmovdqu ymmword ptr [rsp + 112], ymm2   # 32-byte Spill
        vmovdqa ymm2, ymm14
        vmovdqa xmm14, xmm1
        vmovdqa ymm1, ymm9
        vmovdqa ymm9, ymm8
        vmovdqa ymm8, ymm6
        vmovdqa ymm6, ymm5
        vmovdqa ymm5, ymm3
        vmovdqa xmm3, xmm12
        vmovq   xmm12, qword ptr [rbx + rax + 8] # xmm12 = mem[0],zero
        vmovdqa ymm4, ymm15
        vmovdqa ymm15, ymm13
        vmovd   xmm13, edx
        vpinsrd xmm13, xmm13, ecx, 1
        vpmaxsd xmm12, xmm12, xmm13
        vmovq   xmm13, qword ptr [rbx + rax]    # xmm13 = mem[0],zero
        vpbroadcastd    xmm0, dword ptr [rip + .LCPI0_17] # xmm0 = [1,1,1,1]
        vpinsrd xmm0, xmm0, ecx, 0
        vpaddd  xmm0, xmm13, xmm0
        vmovdqa ymm13, ymm15
        vmovdqa ymm15, ymm4
        vpunpcklqdq     xmm0, xmm0, xmm12       # xmm0 = xmm0[0],xmm12[0]
        vmovdqa xmm12, xmm3
        vmovdqa ymm3, ymm5
        vmovdqa ymm5, ymm6
        vmovdqa ymm6, ymm8
        vmovdqa ymm8, ymm9
        vmovdqa ymm9, ymm1
        vmovdqa xmm1, xmm14
        vmovdqa ymm14, ymm2
        vmovdqu ymm2, ymmword ptr [rsp + 112]   # 32-byte Reload

That's 9 registers moved forwards and then later back, none of which are used in the code between (and no there are no jumps to the middle of this).

The C code does include an __asm__ mutating a __m256i (for the purpose of preventing merging simple shuffles into vpermds to reduce register pressure), but that's in a different place and is assigned to ymm7 which does not feature in the problematic excerpt (vmovapd is added within OFENCE_V just to demonstrate this). Nevertheless, replacing OFENCE_V with #define OFENCE_V(X) X gets rid of the problem (perhaps by chance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants