Skip to content

Vectorize find_first_not_of/find_last_not_of member functions (multiple characters overloads) #5206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 24, 2025

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Dec 25, 2024

Two remaining in find_meow_of family,
Together with #5102 should complete basic_string vectorization coverage.

Surprisingly not trivial change. The not flavor does not have early return for the inner (needle) loop. This severely impacts the paths that do have this inner loop.

⚙️ Product code changes

Added the implementation of find_meow_not_of for 8 and 16 bit characters.

No 32-bit and 64-bit characters vectorization. We happen to support them in find_first_of, because it exists as a free function callable with integers or pointers, but supporting them in find_first_not_of would take severely altering the specific AVX2 algorithm, that doesn't need to be altered otherwise.

The implementation is added into existing functions via a template parameter, like in #5102. For bitmap algorithms and small needle path it is only a matter of results negation or bit mask inversion, which is done:

The fallback nested loop has a separate compile-time branch without early return.

For SSE4.2 large needle branch. in addition to the negation in the intrinsic parameter, need also to switch to no-early-return inner loop, and combine the results. The _Test_whole_needle lambda has changed to have different loop based on template parameter. It was also changed to return position, and having inner lambda _Step instead of them both. The lambda change can potentially affect codegen in non-not control path, but I don't expect it to be too much of impact, if any at all.

🏁 Benchmark code changes

The fill strategy was altered to:

  • Avoid limits for the needle length to benchmark any needle length
  • Provide the different coverage for not member functions which makes more sense for them

So the iota was dropped. Still incremental values are used to fill needle. because it is boring to just memset std::fill it.

💹 Performance expectations

The not function are expected to perform almost the same, as their positive counterpart. But sure we can't have supersymmetry here.

The noticeable distinct thing is SSE4.2 path with different instructions. It has less control flow, but it has PCMPESTRM instead of PCMPESTRI, Their performance is overall the same, but there is some small difference on some CPUs, Decent Intels tend to like PCMPESTRI, decent AMDs tend to make no difference, older AMDs and power-saving Intels tend to like PCMPESTRM.

See the comparison on uops.info.

Apparently we're good on big scale, and fine tuning cannot be addressed anyway, so I didn't attempt to look for new thresholds for not functions.

⏱️ Benchmark results

i5 1235U

Benchmark Time Time
bm<AlgType::str_member_first_not, char>/2/3 5.98 ns 5.58 ns
bm<AlgType::str_member_first_not, char>/6/81 36.7 ns 23.8 ns
bm<AlgType::str_member_first_not, char>/7/4 12.5 ns 18.6 ns
bm<AlgType::str_member_first_not, char>/9/3 12.9 ns 16.6 ns
bm<AlgType::str_member_first_not, char>/22/5 16.2 ns 17.8 ns
bm<AlgType::str_member_first_not, char>/58/2 25.8 ns 17.6 ns
bm<AlgType::str_member_first_not, char>/75/85 63.8 ns 45.8 ns
bm<AlgType::str_member_first_not, char>/102/4 43.8 ns 19.5 ns
bm<AlgType::str_member_first_not, char>/200/46 82.3 ns 39.9 ns
bm<AlgType::str_member_first_not, char>/325/1 95.8 ns 38.2 ns
bm<AlgType::str_member_first_not, char>/400/50 131 ns 58.0 ns
bm<AlgType::str_member_first_not, char>/1011/11 262 ns 115 ns
bm<AlgType::str_member_first_not, char>/1280/46 338 ns 122 ns
bm<AlgType::str_member_first_not, char>/1502/23 380 ns 133 ns
bm<AlgType::str_member_first_not, char>/2203/54 563 ns 237 ns
bm<AlgType::str_member_first_not, char>/3056/7 740 ns 268 ns
bm<AlgType::str_member_first_not, wchar_t>/2/3 5.65 ns 6.15 ns
bm<AlgType::str_member_first_not, wchar_t>/6/81 38.3 ns 49.3 ns
bm<AlgType::str_member_first_not, wchar_t>/7/4 11.6 ns 14.5 ns
bm<AlgType::str_member_first_not, wchar_t>/9/3 11.3 ns 14.6 ns
bm<AlgType::str_member_first_not, wchar_t>/22/5 15.8 ns 15.3 ns
bm<AlgType::str_member_first_not, wchar_t>/58/2 29.8 ns 20.7 ns
bm<AlgType::str_member_first_not, wchar_t>/75/85 69.6 ns 52.8 ns
bm<AlgType::str_member_first_not, wchar_t>/102/4 52.7 ns 27.9 ns
bm<AlgType::str_member_first_not, wchar_t>/200/46 106 ns 50.4 ns
bm<AlgType::str_member_first_not, wchar_t>/325/1 132 ns 58.4 ns
bm<AlgType::str_member_first_not, wchar_t>/400/50 180 ns 65.7 ns
bm<AlgType::str_member_first_not, wchar_t>/1011/11 375 ns 139 ns
bm<AlgType::str_member_first_not, wchar_t>/1280/46 488 ns 155 ns
bm<AlgType::str_member_first_not, wchar_t>/1502/23 555 ns 186 ns
bm<AlgType::str_member_first_not, wchar_t>/2203/54 897 ns 266 ns
bm<AlgType::str_member_first_not, wchar_t>/3056/7 1120 ns 333 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 15.6 ns 17.4 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 22.6 ns 29.4 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 23.1 ns 15.3 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 34.4 ns 15.6 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 51.7 ns 16.0 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 112 ns 20.6 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 177 ns 196 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 203 ns 27.7 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 480 ns 275 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 445 ns 67.9 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 963 ns 623 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 2773 ns 443 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 3156 ns 1671 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 3413 ns 867 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 5532 ns 3279 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 8053 ns 566 ns
bm<AlgType::str_member_last_not, char>/2/3 5.12 ns 5.42 ns
bm<AlgType::str_member_last_not, char>/6/81 32.4 ns 22.7 ns
bm<AlgType::str_member_last_not, char>/7/4 10.8 ns 17.3 ns
bm<AlgType::str_member_last_not, char>/9/3 11.1 ns 14.7 ns
bm<AlgType::str_member_last_not, char>/22/5 14.9 ns 15.6 ns
bm<AlgType::str_member_last_not, char>/58/2 27.6 ns 15.8 ns
bm<AlgType::str_member_last_not, char>/75/85 53.4 ns 40.5 ns
bm<AlgType::str_member_last_not, char>/102/4 45.5 ns 17.9 ns
bm<AlgType::str_member_last_not, char>/200/46 86.0 ns 36.5 ns
bm<AlgType::str_member_last_not, char>/325/1 103 ns 37.8 ns
bm<AlgType::str_member_last_not, char>/400/50 138 ns 57.5 ns
bm<AlgType::str_member_last_not, char>/1011/11 276 ns 116 ns
bm<AlgType::str_member_last_not, char>/1280/46 363 ns 138 ns
bm<AlgType::str_member_last_not, char>/1502/23 415 ns 144 ns
bm<AlgType::str_member_last_not, char>/2203/54 601 ns 210 ns
bm<AlgType::str_member_last_not, char>/3056/7 826 ns 263 ns
bm<AlgType::str_member_last_not, wchar_t>/2/3 5.82 ns 5.77 ns
bm<AlgType::str_member_last_not, wchar_t>/6/81 37.8 ns 43.8 ns
bm<AlgType::str_member_last_not, wchar_t>/7/4 9.71 ns 14.2 ns
bm<AlgType::str_member_last_not, wchar_t>/9/3 10.4 ns 14.2 ns
bm<AlgType::str_member_last_not, wchar_t>/22/5 15.7 ns 15.1 ns
bm<AlgType::str_member_last_not, wchar_t>/58/2 36.6 ns 19.0 ns
bm<AlgType::str_member_last_not, wchar_t>/75/85 78.3 ns 52.8 ns
bm<AlgType::str_member_last_not, wchar_t>/102/4 55.8 ns 26.5 ns
bm<AlgType::str_member_last_not, wchar_t>/200/46 114 ns 46.8 ns
bm<AlgType::str_member_last_not, wchar_t>/325/1 166 ns 42.5 ns
bm<AlgType::str_member_last_not, wchar_t>/400/50 187 ns 62.5 ns
bm<AlgType::str_member_last_not, wchar_t>/1011/11 381 ns 127 ns
bm<AlgType::str_member_last_not, wchar_t>/1280/46 539 ns 150 ns
bm<AlgType::str_member_last_not, wchar_t>/1502/23 563 ns 170 ns
bm<AlgType::str_member_last_not, wchar_t>/2203/54 847 ns 265 ns
bm<AlgType::str_member_last_not, wchar_t>/3056/7 1242 ns 375 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2/3 13.2 ns 14.7 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/6/81 25.4 ns 29.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/7/4 21.4 ns 14.1 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/9/3 32.0 ns 14.4 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/22/5 49.6 ns 14.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/58/2 110 ns 19.5 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/75/85 186 ns 211 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/102/4 203 ns 26.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/200/46 489 ns 309 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/325/1 474 ns 65.4 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/400/50 1151 ns 707 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1011/11 2455 ns 620 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1280/46 3207 ns 1924 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1502/23 4029 ns 1346 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2203/54 5595 ns 3755 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/3056/7 7376 ns 557 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 25, 2024 12:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jan 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 4, 2025
@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Jan 14, 2025
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Feb 24, 2025
@StephanTLavavej
Copy link
Member

Thanks! 💚 I pushed minor changes, the most significant being CodeQL warning suppressions to avoid a headache in a few months.

5950X speedups look great:

Benchmark Before After Speedup
first_not, char>/2/3 9.60 ns 9.68 ns 0.99
first_not, char>/6/81 27.8 ns 21.9 ns 1.27
first_not, char>/7/4 11.1 ns 18.3 ns 0.61
first_not, char>/9/3 11.7 ns 16.3 ns 0.72
first_not, char>/22/5 19.2 ns 16.7 ns 1.15
first_not, char>/58/2 33.1 ns 17.3 ns 1.91
first_not, char>/75/85 65.3 ns 45.0 ns 1.45
first_not, char>/102/4 57.7 ns 18.1 ns 3.19
first_not, char>/200/46 109 ns 43.3 ns 2.52
first_not, char>/325/1 153 ns 26.7 ns 5.73
first_not, char>/400/50 197 ns 57.1 ns 3.45
first_not, char>/1011/11 453 ns 93.1 ns 4.87
first_not, char>/1280/46 575 ns 120 ns 4.79
first_not, char>/1502/23 661 ns 129 ns 5.12
first_not, char>/2203/54 963 ns 175 ns 5.50
first_not, char>/3056/7 1351 ns 209 ns 6.46
first_not, wchar_t>/2/3 8.96 ns 8.53 ns 1.05
first_not, wchar_t>/6/81 29.5 ns 52.5 ns 0.56
first_not, wchar_t>/7/4 11.4 ns 15.4 ns 0.74
first_not, wchar_t>/9/3 12.0 ns 15.7 ns 0.76
first_not, wchar_t>/22/5 18.7 ns 15.9 ns 1.18
first_not, wchar_t>/58/2 33.3 ns 17.6 ns 1.89
first_not, wchar_t>/75/85 65.4 ns 45.8 ns 1.43
first_not, wchar_t>/102/4 58.1 ns 23.6 ns 2.46
first_not, wchar_t>/200/46 109 ns 52.0 ns 2.10
first_not, wchar_t>/325/1 153 ns 39.6 ns 3.86
first_not, wchar_t>/400/50 197 ns 66.3 ns 2.97
first_not, wchar_t>/1011/11 449 ns 118 ns 3.81
first_not, wchar_t>/1280/46 571 ns 153 ns 3.73
first_not, wchar_t>/1502/23 666 ns 164 ns 4.06
first_not, wchar_t>/2203/54 967 ns 229 ns 4.22
first_not, wchar_t>/3056/7 1325 ns 290 ns 4.57
first_not, wchar_t, L'\x03B1'>/2/3 9.64 ns 9.88 ns 0.98
first_not, wchar_t, L'\x03B1'>/6/81 37.3 ns 21.3 ns 1.75
first_not, wchar_t, L'\x03B1'>/7/4 15.3 ns 15.3 ns 1.00
first_not, wchar_t, L'\x03B1'>/9/3 16.6 ns 15.7 ns 1.06
first_not, wchar_t, L'\x03B1'>/22/5 34.8 ns 15.8 ns 2.20
first_not, wchar_t, L'\x03B1'>/58/2 55.8 ns 17.5 ns 3.19
first_not, wchar_t, L'\x03B1'>/75/85 247 ns 88.1 ns 2.80
first_not, wchar_t, L'\x03B1'>/102/4 122 ns 22.0 ns 5.55
first_not, wchar_t, L'\x03B1'>/200/46 593 ns 136 ns 4.36
first_not, wchar_t, L'\x03B1'>/325/1 223 ns 39.9 ns 5.59
first_not, wchar_t, L'\x03B1'>/400/50 1155 ns 277 ns 4.17
first_not, wchar_t, L'\x03B1'>/1011/11 2129 ns 236 ns 9.02
first_not, wchar_t, L'\x03B1'>/1280/46 3510 ns 706 ns 4.97
first_not, wchar_t, L'\x03B1'>/1502/23 3724 ns 456 ns 8.17
first_not, wchar_t, L'\x03B1'>/2203/54 6200 ns 1384 ns 4.48
first_not, wchar_t, L'\x03B1'>/3056/7 4484 ns 268 ns 16.73
last_not, char>/2/3 8.11 ns 8.30 ns 0.98
last_not, char>/6/81 26.6 ns 21.6 ns 1.23
last_not, char>/7/4 9.64 ns 16.6 ns 0.58
last_not, char>/9/3 9.88 ns 14.3 ns 0.69
last_not, char>/22/5 13.8 ns 15.0 ns 0.92
last_not, char>/58/2 21.3 ns 14.9 ns 1.43
last_not, char>/75/85 47.0 ns 40.2 ns 1.17
last_not, char>/102/4 32.0 ns 16.6 ns 1.93
last_not, char>/200/46 68.6 ns 40.5 ns 1.69
last_not, char>/325/1 96.5 ns 27.4 ns 3.52
last_not, char>/400/50 113 ns 54.7 ns 2.07
last_not, char>/1011/11 234 ns 90.8 ns 2.58
last_not, char>/1280/46 297 ns 117 ns 2.54
last_not, char>/1502/23 344 ns 124 ns 2.77
last_not, char>/2203/54 498 ns 176 ns 2.83
last_not, char>/3056/7 675 ns 197 ns 3.43
last_not, wchar_t>/2/3 8.98 ns 8.78 ns 1.02
last_not, wchar_t>/6/81 47.2 ns 33.3 ns 1.42
last_not, wchar_t>/7/4 11.6 ns 14.0 ns 0.83
last_not, wchar_t>/9/3 10.9 ns 13.7 ns 0.80
last_not, wchar_t>/22/5 15.1 ns 14.4 ns 1.05
last_not, wchar_t>/58/2 25.4 ns 17.5 ns 1.45
last_not, wchar_t>/75/85 65.2 ns 44.5 ns 1.47
last_not, wchar_t>/102/4 40.1 ns 22.5 ns 1.78
last_not, wchar_t>/200/46 98.3 ns 50.0 ns 1.97
last_not, wchar_t>/325/1 120 ns 38.5 ns 3.12
last_not, wchar_t>/400/50 163 ns 64.2 ns 2.54
last_not, wchar_t>/1011/11 360 ns 114 ns 3.16
last_not, wchar_t>/1280/46 443 ns 148 ns 2.99
last_not, wchar_t>/1502/23 512 ns 158 ns 3.24
last_not, wchar_t>/2203/54 752 ns 221 ns 3.40
last_not, wchar_t>/3056/7 999 ns 270 ns 3.70
last_not, wchar_t, L'\x03B1'>/2/3 8.98 ns 8.94 ns 1.00
last_not, wchar_t, L'\x03B1'>/6/81 45.7 ns 20.5 ns 2.23
last_not, wchar_t, L'\x03B1'>/7/4 15.3 ns 13.7 ns 1.12
last_not, wchar_t, L'\x03B1'>/9/3 17.0 ns 13.9 ns 1.22
last_not, wchar_t, L'\x03B1'>/22/5 37.2 ns 14.1 ns 2.64
last_not, wchar_t, L'\x03B1'>/58/2 63.4 ns 17.3 ns 3.66
last_not, wchar_t, L'\x03B1'>/75/85 266 ns 94.7 ns 2.81
last_not, wchar_t, L'\x03B1'>/102/4 133 ns 21.7 ns 6.13
last_not, wchar_t, L'\x03B1'>/200/46 606 ns 148 ns 4.09
last_not, wchar_t, L'\x03B1'>/325/1 292 ns 43.2 ns 6.76
last_not, wchar_t, L'\x03B1'>/400/50 1199 ns 312 ns 3.84
last_not, wchar_t, L'\x03B1'>/1011/11 2190 ns 342 ns 6.40
last_not, wchar_t, L'\x03B1'>/1280/46 3783 ns 829 ns 4.56
last_not, wchar_t, L'\x03B1'>/1502/23 3818 ns 614 ns 6.22
last_not, wchar_t, L'\x03B1'>/2203/54 6337 ns 1593 ns 3.98
last_not, wchar_t, L'\x03B1'>/3056/7 5124 ns 269 ns 19.05

@StephanTLavavej StephanTLavavej removed their assignment Mar 7, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Mar 7, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Mar 21, 2025
@StephanTLavavej StephanTLavavej self-assigned this Mar 21, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request Mar 21, 2025
@StephanTLavavej StephanTLavavej merged commit 185398a into microsoft:main Mar 24, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Mar 24, 2025
@StephanTLavavej
Copy link
Member

Thanks thanks thanks! 🐱 🐈 🐈‍⬛

@AlexGuteniev AlexGuteniev deleted the not branch March 24, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants