Skip to content

[X86] movb is extremely expensive on modern processors #62948

Open
@ryao

Description

@ryao

Lately, I have been trying to micro-optimize a binary search function that operates on 4KB sized arrays. Doing some experiments with loop unrolling yielded three alternative implementations.

https://gcc.godbolt.org/z/79rq7sqcf

They all should perform similarly, yet on my Ryzen 7 5800X, they do not:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077567, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        980 |       9020 |   0.000220 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        980 |       9020 |   0.000564 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        980 |       9020 |   0.000567 |

I apply the following patch to the assembly code. Then assemble and link the result:

diff --git a/out.s b/out.s
index f42f44c..24d6940 100644
--- a/out.s
+++ b/out.s
@@ -346,8 +346,8 @@ custom_binary_search_unroll_switch_v2:  # @custom_binary_search_unroll_switch_v2
        bsrl    %eax, %r8d
        movl    %r8d, %eax
        xorl    $31, %eax
-       movb    $30, %cl
-       subb    %al, %cl
+       movl    $30, %ecx
+       subl    %eax, %ecx
        movl    $1, %eax
        shll    %cl, %eax
        leal    -1(%rax), %ecx
@@ -481,8 +481,8 @@ custom_binary_search_unroll_switch_v3:  # @custom_binary_search_unroll_switch_v3
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d

Now, when I rerun the micro-benchmark, all 3 perform similarly:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077784, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        945 |       9055 |   0.000224 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        945 |       9055 |   0.000206 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        945 |       9055 |   0.000202 |

If I change the patch to only change subb to subl, the performance remains unchanged.

If I compile with GCC 12.2, I see an even better execution time on the last one:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077837, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        972 |       9028 |   0.000320 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        972 |       9028 |   0.000340 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        972 |       9028 |   0.000178 |

That is unsuprising, considering that GCC emits fewer instructions for the switch statement body in v3, while it emits many more in v1 and v2. Here is the first case statement from GCC for v3:

.L54:
        leal    512(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

And the first case statement from LLVM for v3:

        leal    512(%rcx), %eax
        cmpl    %edx, (%rdi,%rax,4)
        cmovgl  %ecx, %eax
        movl    %eax, %ecx

In any case, passing -mtune=znver3 does not stop LLVM from using movb and subb. Whatever optimization pass is opportunistically lowering operations to byte operations should be made to stop doing that on AMD64.

Slightly smaller code size is not worth it when it kills performance on a popular AMD64 processor family. In this case, using movb saves 3 bytes, while subb and subl are the same number of bytes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions