Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(simd): avx2 fallback to swar instead of sse4.2 #181

Merged
merged 1 commit into from
Sep 3, 2024

Conversation

AaronO
Copy link
Contributor

@AaronO AaronO commented Sep 2, 2024

TLDR: 2-line change => 2x faster req/req
(doesn't raise the ceiling substantially but fixes perf issue of generic x64 build, using runtime dispatch)

This has massive implications on the default simd::runtime::* (x64 generic build) perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful).

Should supersede #175, #156

Benchmarks on GH CodeSpace (4-core / 16GB)

(4 cores of a 64-core AMD EPYC 7763 host CPU)

> critcmp -t=5 main avx2swar
group                    avx2swar                               main
-----                    --------                               ----
header/count_001         1.00     21.8±3.10ns   349.2 MB/sec    6.14   134.0±11.80ns    56.9 MB/sec
header/count_002         1.00     31.9±4.18ns   418.7 MB/sec    7.95    253.5±8.07ns    52.7 MB/sec
header/count_004         1.00     49.8±4.39ns   498.2 MB/sec    10.08   501.9±4.58ns    49.4 MB/sec
header/count_008         1.00    94.1±15.11ns   506.8 MB/sec    7.04   662.1±15.47ns    72.0 MB/sec
header/count_016         1.00   186.2±21.89ns   501.9 MB/sec    3.95   735.3±12.50ns   127.1 MB/sec
header/count_032         1.00   349.9±37.92ns   528.7 MB/sec    2.58   902.5±28.32ns   205.0 MB/sec
header/count_064         1.00   655.4±29.35ns   561.7 MB/sec    2.02  1322.6±186.03ns   278.3 MB/sec
header/count_128         1.00  1281.6±66.01ns   573.0 MB/sec    1.43  1830.4±43.23ns   401.2 MB/sec
header/name_0001b        1.00     23.0±2.95ns   332.4 MB/sec    6.20   142.3±11.68ns    53.6 MB/sec
header/name_0002b        1.00     23.1±3.12ns   372.2 MB/sec    5.73    132.2±4.84ns    64.9 MB/sec
header/name_0004b        1.00     22.9±2.51ns   457.8 MB/sec    5.80    133.0±2.89ns    78.9 MB/sec
header/name_0008b        1.00     22.9±2.82ns   625.5 MB/sec    5.82    133.1±3.36ns   107.4 MB/sec
header/name_0016b        1.00     23.3±2.47ns   939.6 MB/sec    5.82    135.8±3.53ns   161.5 MB/sec
header/name_0032b        1.00     28.0±2.29ns  1327.5 MB/sec    5.00    140.0±4.22ns   265.6 MB/sec
header/name_0064b        1.00    44.8±11.81ns  1512.9 MB/sec    3.33    149.1±4.62ns   454.2 MB/sec
header/name_0128b        1.00     56.6±3.79ns     2.2 GB/sec    2.96    167.8±5.86ns   767.4 MB/sec
header/name_0256b        1.00     94.8±4.39ns     2.6 GB/sec    2.21   209.2±14.19ns  1198.8 MB/sec
header/name_0512b        1.00   182.7±10.37ns     2.6 GB/sec    1.62   296.5±27.59ns  1669.6 MB/sec
header/name_1024b        1.00   336.9±15.06ns     2.8 GB/sec    1.34   450.5±49.04ns     2.1 GB/sec
header/name_2048b        1.00   648.7±26.80ns     3.0 GB/sec    1.27  822.0±149.32ns     2.3 GB/sec
header/name_4096b        1.00  1280.2±78.67ns     3.0 GB/sec    1.06  1362.2±70.01ns     2.8 GB/sec
header/value_0001b       1.00     23.4±6.47ns   326.1 MB/sec    5.74    134.2±4.17ns    56.8 MB/sec
header/value_0002b       1.00     21.7±2.94ns   396.0 MB/sec    6.12    132.7±3.34ns    64.7 MB/sec
header/value_0004b       1.00     21.8±3.60ns   481.5 MB/sec    6.26    136.4±5.54ns    76.9 MB/sec
header/value_0008b       1.00     21.6±3.09ns   661.7 MB/sec    6.44   139.2±11.89ns   102.8 MB/sec
header/value_0016b       1.00     23.5±5.80ns   932.0 MB/sec    5.74    135.0±1.77ns   162.5 MB/sec
header/value_0064b       1.00     21.9±3.56ns     3.0 GB/sec    1.09     23.9±3.85ns     2.8 GB/sec
header/value_0128b       1.00     23.3±3.55ns     5.4 GB/sec    1.06     24.7±3.46ns     5.1 GB/sec
header/value_0512b       1.00    46.3±12.30ns    10.4 GB/sec    1.13    52.4±20.00ns     9.2 GB/sec
header/value_1024b       1.00     54.5±5.73ns    17.6 GB/sec    1.29    70.5±23.99ns    13.6 GB/sec
header/value_2048b       1.00     96.1±8.39ns    19.9 GB/sec    1.12    107.3±6.35ns    17.8 GB/sec
header/value_4096b       1.00   178.1±40.06ns    21.5 GB/sec    1.18   209.7±50.09ns    18.2 GB/sec
req/req                  1.00    168.5±6.15ns     4.0 GB/sec    1.94   327.5±70.03ns     2.0 GB/sec
req_short/req_short      1.00    63.7±15.68ns  1017.9 MB/sec    2.61    166.3±2.30ns   390.0 MB/sec
resp/resp                1.00    190.2±7.97ns     3.4 GB/sec    1.55    294.1±3.77ns     2.2 GB/sec
resp_short/resp_short    1.00     58.4±8.26ns  1485.9 MB/sec    2.86    167.3±1.67ns   518.7 MB/sec
uri/uri_0001b            1.00      6.3±0.64ns   303.4 MB/sec    18.62   117.1±1.01ns    16.3 MB/sec
uri/uri_0002b            1.00      7.3±1.16ns   391.7 MB/sec    16.19   118.3±1.67ns    24.2 MB/sec
uri/uri_0004b            1.00      9.3±0.40ns   512.3 MB/sec    13.12   122.1±3.56ns    39.0 MB/sec
uri/uri_0008b            1.00      5.9±0.22ns  1454.0 MB/sec    21.43  126.5±13.42ns    67.8 MB/sec
uri/uri_0016b            1.00      6.9±0.40ns     2.3 GB/sec    17.31   119.1±1.40ns   136.2 MB/sec
uri/uri_0032b            1.00      6.6±1.20ns     4.7 GB/sec    18.20   120.0±2.92ns   262.3 MB/sec
uri/uri_0064b            1.00      7.2±0.46ns     8.4 GB/sec    16.86   121.6±2.81ns   509.6 MB/sec
uri/uri_0128b            1.00      9.1±0.57ns    13.2 GB/sec    13.55   123.2±1.56ns   998.8 MB/sec
uri/uri_0256b            1.00     12.8±0.52ns    18.7 GB/sec    10.52   134.5±9.75ns  1821.9 MB/sec
uri/uri_0512b            1.00     20.2±1.11ns    23.6 GB/sec    6.88    139.2±3.70ns     3.4 GB/sec
uri/uri_1024b            1.00     35.4±2.58ns    26.9 GB/sec    4.60   163.0±18.05ns     5.9 GB/sec
uri/uri_2048b            1.00     65.4±4.08ns    29.2 GB/sec    3.15   205.7±18.82ns     9.3 GB/sec
uri/uri_4096b            1.00    132.1±8.96ns    28.9 GB/sec    2.26   299.2±55.29ns    12.8 GB/sec
version/http10           1.07      1.3±0.28ns     7.0 GB/sec    1.00      1.2±0.03ns     7.5 GB/sec

This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful).

Should supersede seanmonstar#175, seanmonstar#156
@AaronO
Copy link
Contributor Author

AaronO commented Sep 2, 2024

@seanmonstar @lucab I'll bench this against for #175 for completeness sake, but this is substantially simpler/focused, should not regress arch64/etc... and should be faster overall.

@AaronO
Copy link
Contributor Author

AaronO commented Sep 2, 2024

#175 vs #181 (this PR)

TLDR: this demonstrates #181 provides the bulk of #175's benefits, with trivial/minimal focused changes on the core sse42/avx2 interplay issue.

> critcmp -t=5 pr-175 pr-181
group                    pr-175                                  pr-181
-----                    ------                                  ------
header/count_128         1.18  1513.1±353.72ns   485.3 MB/sec    1.00  1277.3±64.41ns   574.9 MB/sec
header/name_1024b        1.00    310.5±7.39ns     3.1 GB/sec     1.08   335.8±10.04ns     2.9 GB/sec
header/name_4096b        1.00  1145.4±22.27ns     3.3 GB/sec     1.11  1266.2±35.40ns     3.0 GB/sec
header/value_1024b       1.12     60.6±7.16ns    15.8 GB/sec     1.00     54.0±6.61ns    17.8 GB/sec
header/value_2048b       1.19   111.3±13.59ns    17.2 GB/sec     1.00     93.9±5.99ns    20.4 GB/sec
header/value_4096b       1.17   228.6±54.54ns    16.7 GB/sec     1.00   194.8±52.47ns    19.6 GB/sec
method/custom            1.07      4.7±0.25ns     3.8 GB/sec     1.00      4.4±0.23ns     4.1 GB/sec
method/delete            1.08      4.7±0.46ns     3.7 GB/sec     1.00      4.4±0.31ns     4.0 GB/sec
method/head              1.13      3.9±0.61ns     4.1 GB/sec     1.00      3.4±0.15ns     4.6 GB/sec
method/patch             1.07      5.1±1.45ns     3.3 GB/sec     1.00      4.8±1.25ns     3.5 GB/sec
req_short/req_short      1.00     49.4±1.69ns  1312.6 MB/sec     1.14     56.2±8.88ns  1154.1 MB/sec
resp/resp                1.00   177.3±10.63ns     3.7 GB/sec     1.11    196.1±4.99ns     3.3 GB/sec
resp_short/resp_short    1.00     56.4±9.72ns  1539.2 MB/sec     1.32    74.2±42.30ns  1169.3 MB/sec
uri/uri_1024b            1.09     38.2±4.97ns    25.0 GB/sec     1.00     35.2±1.76ns    27.1 GB/sec
uri/uri_2048b            1.18     76.4±2.88ns    25.0 GB/sec     1.00     64.8±1.26ns    29.4 GB/sec
uri/uri_4096b            1.08    141.6±8.17ns    26.9 GB/sec     1.00    131.3±4.65ns    29.1 GB/sec

@lucab @seanmonstar I think we should land this, no regressions on aarch64, focused change on the core problem. Other improvements #175 provided can be explored in focused follow-ups.

@AaronO AaronO changed the title perf(simd): avx2 fallack to swar instead of sse4.2 perf(simd): avx2 fallback to swar instead of sse4.2 Sep 2, 2024
@AaronO
Copy link
Contributor Author

AaronO commented Sep 2, 2024

For good measure, compared master and #181 built with target-cpu=native (basically exercising simd::avx2 bypassing simd::runtime) to test perf upper-bound on x64:

> critcmp -t=5 main-native pr-181-native
group                 main-native                             pr-181-native
-----                 -----------                             -------------
header/count_001      1.00     18.6±2.29ns   410.1 MB/sec     1.20     22.4±3.86ns   341.3 MB/sec
header/count_004      1.00    46.8±14.86ns   529.5 MB/sec     1.20    56.3±22.04ns   440.4 MB/sec
header/count_008      1.74   123.7±24.41ns   385.6 MB/sec     1.00     71.0±5.33ns   671.9 MB/sec
header/count_016      1.24   194.6±54.63ns   480.2 MB/sec     1.00    157.5±6.81ns   593.2 MB/sec
header/count_032      1.21   367.6±78.38ns   503.4 MB/sec     1.00   303.4±10.46ns   609.9 MB/sec
header/count_064      1.13  664.8±107.64ns   553.7 MB/sec     1.00   589.3±15.97ns   624.7 MB/sec
header/count_128      1.17  1388.5±303.20ns   528.9 MB/sec    1.00  1185.1±59.27ns   619.6 MB/sec
header/name_0004b     1.00     19.0±2.72ns   552.0 MB/sec     1.09     20.7±3.19ns   507.3 MB/sec
header/name_0064b     1.00     36.3±2.35ns  1863.8 MB/sec     1.08     39.2±9.02ns  1728.0 MB/sec
header/name_0128b     1.05     57.0±3.22ns     2.2 GB/sec     1.00     54.0±2.86ns     2.3 GB/sec
header/name_0256b     1.00     98.6±5.44ns     2.5 GB/sec     1.06   104.9±21.57ns     2.3 GB/sec
header/name_0512b     1.07   191.4±16.16ns     2.5 GB/sec     1.00    179.5±5.83ns     2.7 GB/sec
header/name_1024b     1.07   358.0±16.83ns     2.7 GB/sec     1.00   333.3±14.38ns     2.9 GB/sec
header/name_4096b     1.15  1450.0±259.57ns     2.6 GB/sec    1.00  1257.4±46.92ns     3.0 GB/sec
header/value_0004b    1.00     19.4±2.06ns   539.9 MB/sec     1.15     22.4±6.94ns   468.5 MB/sec
header/value_0008b    1.00     19.0±2.06ns   753.7 MB/sec     1.37     26.0±7.92ns   550.0 MB/sec
header/value_0256b    1.00     50.9±9.02ns     4.8 GB/sec     1.13    57.6±56.91ns     4.3 GB/sec
method/custom         1.00      4.7±0.27ns     3.7 GB/sec     1.15      5.4±1.09ns     3.2 GB/sec
method/delete         1.00      4.7±0.21ns     3.8 GB/sec     1.11      5.2±1.16ns     3.4 GB/sec
method/patch          1.00      4.6±1.08ns     3.6 GB/sec     1.06      4.9±1.50ns     3.4 GB/sec
method/post           1.33      2.9±0.48ns     5.5 GB/sec     1.00      2.2±0.08ns     7.3 GB/sec
method/trace          1.00      4.5±0.97ns     3.8 GB/sec     1.07      4.8±1.40ns     3.5 GB/sec
resp/resp             1.07   217.5±12.68ns     3.0 GB/sec     1.00    204.0±5.14ns     3.2 GB/sec
uri/uri_0001b         1.80      5.6±0.24ns   340.5 MB/sec     1.00      3.1±0.14ns   612.8 MB/sec
uri/uri_0002b         1.83      6.8±1.30ns   418.4 MB/sec     1.00      3.7±0.18ns   765.2 MB/sec
uri/uri_0004b         1.46      8.4±0.24ns   569.1 MB/sec     1.00      5.7±0.80ns   833.5 MB/sec
uri/uri_0008b         1.81      5.6±0.35ns  1524.5 MB/sec     1.00      3.1±0.06ns     2.7 GB/sec
uri/uri_0016b         1.55      7.8±2.70ns     2.0 GB/sec     1.00      5.1±1.61ns     3.1 GB/sec
uri/uri_0032b         1.71      5.9±0.26ns     5.2 GB/sec     1.00      3.5±0.29ns     8.9 GB/sec
uri/uri_0064b         1.56      6.8±0.34ns     8.9 GB/sec     1.00      4.4±0.30ns    13.9 GB/sec
uri/uri_0128b         1.25     11.9±0.46ns    10.1 GB/sec     1.00      9.5±0.41ns    12.6 GB/sec
uri/uri_0256b         1.27     28.0±1.08ns     8.5 GB/sec     1.00     22.1±1.66ns    10.8 GB/sec
uri/uri_0512b         1.18    64.6±12.17ns     7.4 GB/sec     1.00     54.6±1.71ns     8.7 GB/sec
uri/uri_2048b         1.00    350.4±9.02ns     5.4 GB/sec     1.16    406.1±4.22ns     4.7 GB/sec
version/partial       1.00      3.0±0.10ns     2.2 GB/sec     1.05      3.1±0.07ns     2.1 GB/sec

@seanmonstar seanmonstar merged commit 47853d7 into seanmonstar:master Sep 3, 2024
41 checks passed
@AaronO AaronO deleted the perf/avx2swar branch September 3, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants