Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try SimdUnicode for Utf8 validation #103860

Closed
wants to merge 4 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 23, 2024

Just a CI test for #103781 without changes

@dotnet dotnet deleted a comment from EgorBot Jun 23, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 23, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 23, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 23, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 23, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jun 23, 2024

@EgorBot -arm64 -amd -intel -profiler

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public class Utf8Bench
{
    public static IEnumerable<byte[]> TestData()
    {
        // 1069 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8.ToArray();

        // 527 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит"u8.ToArray();

        // 286 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест."u8.ToArray();

        // 136 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид?"u8.ToArray();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public bool Count(byte[] str) => Utf8.IsValid(str);
}

@EgorBot
Copy link

EgorBot commented Jun 23, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-QSZTTH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-BBPJER : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 365.73 ns 0.175 ns 1.00
Count PR Byte[1069] 151.30 ns 0.042 ns 0.41
Count Main Byte[136] 48.55 ns 0.018 ns 0.13
Count PR Byte[136] 30.77 ns 0.017 ns 0.08
Count Main Byte[286] 88.41 ns 0.041 ns 0.24
Count PR Byte[286] 75.95 ns 0.312 ns 0.21
Count Main Byte[527] 173.61 ns 0.026 ns 0.47
Count PR Byte[527] 69.24 ns 0.137 ns 0.19

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 23, 2024

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores
  Job-UXIZLI : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-CZHAJC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 373.54 ns 6.417 ns 1.00
Count PR Byte[1069] 108.23 ns 0.009 ns 0.29
Count Main Byte[136] 44.05 ns 0.287 ns 0.12
Count PR Byte[136] 33.95 ns 0.032 ns 0.09
Count Main Byte[286] 91.98 ns 0.292 ns 0.25
Count PR Byte[286] 90.22 ns 0.064 ns 0.24
Count Main Byte[527] 174.20 ns 0.197 ns 0.47
Count PR Byte[527] 69.96 ns 0.031 ns 0.19

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 23, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-DDPRTX : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-SVITPF : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 549.63 ns 4.424 ns 1.00
Count PR Byte[1069] 522.30 ns 0.301 ns 0.95
Count Main Byte[136] 73.82 ns 0.222 ns 0.13
Count PR Byte[136] 93.26 ns 0.220 ns 0.17
Count Main Byte[286] 145.59 ns 0.589 ns 0.26
Count PR Byte[286] 169.56 ns 0.256 ns 0.31
Count Main Byte[527] 268.20 ns 2.657 ns 0.49
Count PR Byte[527] 280.00 ns 0.224 ns 0.51

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 23, 2024

So, my understanding that GetPointerToFirstInvalidByte is already well optimized when input is full ASCII (thanks to Ascii.GetIndexOfFirstNonAsciiByte work horse), SimdUnicode is expected to help large inputs (>128 bytes) and when data is never ASCII, e.g. ru-ru language.

See the benchmarks results above ^ I used 3 different CPUs:

  • AMD EPYC 7763 (AVX2): up to 3.4x improvement
  • Intel Xeon Platinum 8370C (AVX512): up to 2.4x improvement
  • Ampere Altra (arm64, NEON): no improvements, small regressions

Also, looks like SimdUnicode is a bit slower when input is semi-ASCII (e.g. rare non-ASCII symbols in random places).

cc @lemire @Nick-Nuon @GrabYourPitchforks

I don't think I am in charge to decide whether we take it or not considering amount of changes, importance of these APIs and the fact that it only kicks in for large inputs, I'll leave that decision to System.Text.* code owners, my only comment that is nice to see optimizations in the non-ASCII space, since .NET is fairly ASCII-focused. E.g. this API in Main:

  • ASCII input: processing data with up to 64 bytes at a time (avx512)
  • non-ASCII: processing data 4 bytes at a time + overhead from "maybe it's ASCII now?" checks

Although, I am surprised we don't see improvements with this PR on arm64..

@EgorBo
Copy link
Member Author

EgorBo commented Jun 23, 2024

Closing as the intent of this PR was to validate changes on CI + run initial benchmarks, see #103860 (comment)

@EgorBo EgorBo closed this Jun 23, 2024
@lemire
Copy link

lemire commented Jun 23, 2024

@EgorBo We could look at the Ampere performance.

However, we have had good luck with Graviton and Apple Silicon. We could also examine the cases where the new code is less favourable generally.

@lemire
Copy link

lemire commented Jun 23, 2024

In the initial release, we did not request inlining for some functions and there are small overhead issues that were not tuned. I did some initial tuning (very little) and here are the results on my Apple M2.

We are now going to include your benchmark data. Hopefully, you've not included swear words in there. :-)

As you can see, at least on my Apple M2, the gains are all over.

Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData data/Arabic-Lipsum.utf8.txt 11,150.23 ns 4,140.044 ns 226.930 ns 7.33
DotnetRuntimeUtf8ValidationRealData data/Arabic-Lipsum.utf8.txt 23,578.99 ns 1,198.002 ns 65.667 ns 3.46
SIMDUtf8ValidationRealData data/Bogatov1069.utf8.txt 173.65 ns 3.222 ns 0.177 ns 6.16
DotnetRuntimeUtf8ValidationRealData data/Bogatov1069.utf8.txt 288.64 ns 32.560 ns 1.785 ns 3.70
SIMDUtf8ValidationRealData data/Bogatov136.utf8.txt 31.41 ns 4.503 ns 0.247 ns 4.33
DotnetRuntimeUtf8ValidationRealData data/Bogatov136.utf8.txt 34.26 ns 43.513 ns 2.385 ns 3.97
SIMDUtf8ValidationRealData data/Bogatov286.utf8.txt 61.45 ns 0.840 ns 0.046 ns 4.65
DotnetRuntimeUtf8ValidationRealData data/Bogatov286.utf8.txt 67.26 ns 3.259 ns 0.179 ns 4.25
SIMDUtf8ValidationRealData data/Bogatov527.utf8.txt 95.69 ns 2.519 ns 0.138 ns 5.51
DotnetRuntimeUtf8ValidationRealData data/Bogatov527.utf8.txt 137.41 ns 7.460 ns 0.409 ns 3.84
SIMDUtf8ValidationRealData data/Chinese-Lipsum.utf8.txt 9,555.57 ns 4,697.847 ns 257.505 ns 7.31
DotnetRuntimeUtf8ValidationRealData data/Chinese-Lipsum.utf8.txt 14,697.34 ns 473.859 ns 25.974 ns 4.75
SIMDUtf8ValidationRealData data/Emoji-Lipsum.utf8.txt 8,848.78 ns 409.647 ns 22.454 ns 7.41
DotnetRuntimeUtf8ValidationRealData data/Emoji-Lipsum.utf8.txt 26,584.62 ns 5,986.146 ns 328.121 ns 2.47
SIMDUtf8ValidationRealData data/Hebrew-Lipsum.utf8.txt 9,000.64 ns 870.451 ns 47.712 ns 7.39
DotnetRuntimeUtf8ValidationRealData data/Hebrew-Lipsum.utf8.txt 19,236.22 ns 1,125.231 ns 61.678 ns 3.46
SIMDUtf8ValidationRealData data/Hindi-Lipsum.utf8.txt 11,847.37 ns 394.000 ns 21.596 ns 7.43
DotnetRuntimeUtf8ValidationRealData data/Hindi-Lipsum.utf8.txt 30,548.37 ns 13,419.182 ns 735.551 ns 2.88
SIMDUtf8ValidationRealData data/Japanese-Lipsum.utf8.txt 9,144.03 ns 631.614 ns 34.621 ns 7.42
DotnetRuntimeUtf8ValidationRealData data/Japanese-Lipsum.utf8.txt 14,796.14 ns 4,665.479 ns 255.731 ns 4.58
SIMDUtf8ValidationRealData data/Korean-Lipsum.utf8.txt 8,988.45 ns 408.843 ns 22.410 ns 7.41
DotnetRuntimeUtf8ValidationRealData data/Korean-Lipsum.utf8.txt 36,644.97 ns 2,960.584 ns 162.280 ns 1.82
SIMDUtf8ValidationRealData data/Latin-Lipsum.utf8.txt 987.38 ns 29.382 ns 1.611 ns 88.05
DotnetRuntimeUtf8ValidationRealData data/Latin-Lipsum.utf8.txt 2,314.99 ns 80.496 ns 4.412 ns 37.56
SIMDUtf8ValidationRealData data/Russian-Lipsum.utf8.txt 14,084.62 ns 151.114 ns 8.283 ns 7.44
DotnetRuntimeUtf8ValidationRealData data/Russian-Lipsum.utf8.txt 37,369.82 ns 11,672.600 ns 639.815 ns 2.80
SIMDUtf8ValidationRealData data/arabic.utf8.txt 82,187.91 ns 50,502.236 ns 2,768.198 ns 6.50
DotnetRuntimeUtf8ValidationRealData data/arabic.utf8.txt 214,777.68 ns 35,829.127 ns 1,963.915 ns 2.49
SIMDUtf8ValidationRealData data/chinese.utf8.txt 23,066.47 ns 14,097.739 ns 772.745 ns 7.86
DotnetRuntimeUtf8ValidationRealData data/chinese.utf8.txt 47,578.13 ns 17,421.400 ns 954.926 ns 3.81
SIMDUtf8ValidationRealData data/czech.utf8.txt 25,166.32 ns 89,823.452 ns 4,923.526 ns 6.07
DotnetRuntimeUtf8ValidationRealData data/czech.utf8.txt 36,974.96 ns 15,225.221 ns 834.546 ns 4.13
SIMDUtf8ValidationRealData data/english.utf8.txt 9,915.88 ns 6,771.447 ns 371.166 ns 39.37
DotnetRuntimeUtf8ValidationRealData data/english.utf8.txt 16,871.46 ns 1,028.389 ns 56.369 ns 23.14
SIMDUtf8ValidationRealData data/esperanto.utf8.txt 5,592.24 ns 547.801 ns 30.027 ns 15.55
DotnetRuntimeUtf8ValidationRealData data/esperanto.utf8.txt 9,920.69 ns 3,405.275 ns 186.655 ns 8.77
SIMDUtf8ValidationRealData data/french.utf8.txt 80,702.61 ns 103,849.158 ns 5,692.323 ns 5.54
DotnetRuntimeUtf8ValidationRealData data/french.utf8.txt 91,890.73 ns 41,256.677 ns 2,261.418 ns 4.86
SIMDUtf8ValidationRealData data/german.utf8.txt 19,670.23 ns 10,211.790 ns 559.743 ns 10.46
DotnetRuntimeUtf8ValidationRealData data/german.utf8.txt 22,855.59 ns 13,218.194 ns 724.534 ns 9.00
SIMDUtf8ValidationRealData data/greek.utf8.txt 18,830.37 ns 9,524.973 ns 522.096 ns 9.63
DotnetRuntimeUtf8ValidationRealData data/greek.utf8.txt 59,530.19 ns 5,484.062 ns 300.600 ns 3.05
SIMDUtf8ValidationRealData data/hebrew.utf8.txt 26,444.46 ns 7,183.847 ns 393.771 ns 7.19
DotnetRuntimeUtf8ValidationRealData data/hebrew.utf8.txt 87,323.57 ns 26,070.434 ns 1,429.008 ns 2.18
SIMDUtf8ValidationRealData data/hindi.utf8.txt 43,328.59 ns 19,721.008 ns 1,080.975 ns 9.15
DotnetRuntimeUtf8ValidationRealData data/hindi.utf8.txt 174,122.43 ns 79,698.156 ns 4,368.525 ns 2.28
SIMDUtf8ValidationRealData data/japanese.utf8.txt 17,843.57 ns 5,313.663 ns 291.260 ns 9.21
DotnetRuntimeUtf8ValidationRealData data/japanese.utf8.txt 41,309.88 ns 73,678.015 ns 4,038.541 ns 3.98
SIMDUtf8ValidationRealData data/korean.utf8.txt 10,928.90 ns 755.720 ns 41.424 ns 8.95
DotnetRuntimeUtf8ValidationRealData data/korean.utf8.txt 32,599.24 ns 16,337.030 ns 895.488 ns 3.00
SIMDUtf8ValidationRealData data/persan.utf8.txt 17,509.60 ns 15,318.330 ns 839.649 ns 8.92
DotnetRuntimeUtf8ValidationRealData data/persan.utf8.txt 52,632.66 ns 96,509.749 ns 5,290.025 ns 2.97
SIMDUtf8ValidationRealData data/portuguese.utf8.txt 33,704.32 ns 103,801.494 ns 5,689.710 ns 8.33
DotnetRuntimeUtf8ValidationRealData data/portuguese.utf8.txt 45,999.20 ns 12,840.213 ns 703.815 ns 6.10
SIMDUtf8ValidationRealData data/russian.utf8.txt 53,841.14 ns 26,186.080 ns 1,435.347 ns 7.56
DotnetRuntimeUtf8ValidationRealData data/russian.utf8.txt 176,708.49 ns 23,215.930 ns 1,272.543 ns 2.30
SIMDUtf8ValidationRealData data/thai.utf8.txt 79,428.83 ns 50,407.553 ns 2,763.008 ns 7.47
DotnetRuntimeUtf8ValidationRealData data/thai.utf8.txt 141,820.79 ns 63,957.115 ns 3,505.705 ns 4.19
SIMDUtf8ValidationRealData data/turkish.utf8.txt 25,698.57 ns 32,743.238 ns 1,794.767 ns 7.59
DotnetRuntimeUtf8ValidationRealData data/turkish.utf8.txt 41,395.77 ns 41,718.767 ns 2,286.746 ns 4.71
SIMDUtf8ValidationRealData data/twitter.json 25,077.72 ns 1,052.253 ns 57.678 ns 25.18
DotnetRuntimeUtf8ValidationRealData data/twitter.json 44,063.35 ns 7,518.587 ns 412.119 ns 14.33
SIMDUtf8ValidationRealData data/vietnamese.utf8.txt 45,562.76 ns 58,772.320 ns 3,221.509 ns 7.00
DotnetRuntimeUtf8ValidationRealData data/vietnamese.utf8.txt 211,912.50 ns 13,985.042 ns 766.567 ns 1.51

@lemire
Copy link

lemire commented Jun 24, 2024

@EgorBo

I have updated results at simdutf/SimdUnicode#46 with positive benchmarks on AWS Graviton 3 (Neoverse V1). We just needed to do some minor tuning.

@lemire
Copy link

lemire commented Jun 24, 2024

On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over four times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 12 8.7 1.4 x
Arabic-Lipsum 3.4 2.0 1.7 x
Chinese-Lipsum 3.4 2.6 1.3 x
Emoji-Lipsum 3.4 0.8 4.3 x
Hebrew-Lipsum 3.4 2.0 1.7 x
Hindi-Lipsum 3.4 1.6 2.1 x
 Japanese-Lipsum 3.4 2.4  1.4 x
Korean-Lipsum 3.4 1.3 2.6 x
Latin-Lipsum 42 17 2.5 x
Russian-Lipsum 3.3 0.95 3.5 x

Please see simdutf/SimdUnicode#46 for full details, including @EgorBo's inputs which are now part of our benchmarks.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 24, 2024

@EgorBot -arm64 -amd -profiler

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public class Utf8Bench
{
    public static IEnumerable<byte[]> TestData()
    {
        // 1069 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8.ToArray();

        // 527 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит"u8.ToArray();

        // 286 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест."u8.ToArray();

        // 136 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид?"u8.ToArray();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public bool Count(byte[] str) => Utf8.IsValid(str);
}

@EgorBot
Copy link

EgorBot commented Jun 24, 2024

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores
  Job-RCJAOC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-LNZCSK : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 349.06 ns 2.448 ns 1.00
Count PR Byte[1069] 106.05 ns 0.295 ns 0.30
Count Main Byte[136] 43.81 ns 0.325 ns 0.13
Count PR Byte[136] 28.53 ns 0.240 ns 0.08
Count Main Byte[286] 87.64 ns 0.156 ns 0.25
Count PR Byte[286] 71.61 ns 0.086 ns 0.21
Count Main Byte[527] 173.25 ns 1.229 ns 0.50
Count PR Byte[527] 70.08 ns 0.296 ns 0.20

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@lemire
Copy link

lemire commented Jun 24, 2024

The arm results are puzzling to me

I think that they are explained by the type of ARM processor being tested. I suspect it might be a processor with weak SIMD performance like Neoverse N1. A possibility to consider is to selectively enable SimdUnicode. I expect good performance on recent ARM-based Qualcomm processors.

@lemire
Copy link

lemire commented Jun 24, 2024

Here are the results with a Windows ARM dev kit (2023) with a Snapdragon 8cx Gen 3. They are pretty good!!!

Of course, that's not going to be an important platform but I do not yet have access to hardware with the latest Qualcomm processors.

Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData data/Arabic-Lipsum.utf8.txt 20,854.86 ns 2,899.710 ns 158.943 ns 3.93
DotnetRuntimeUtf8ValidationRealData data/Arabic-Lipsum.utf8.txt 36,472.92 ns 5,674.598 ns 311.044 ns 2.25
SIMDUtf8ValidationRealData data/Bogatov1069.utf8.txt 298.74 ns 37.990 ns 2.082 ns 3.58
DotnetRuntimeUtf8ValidationRealData data/Bogatov1069.utf8.txt 463.87 ns 172.728 ns 9.468 ns 2.30
SIMDUtf8ValidationRealData data/Bogatov136.utf8.txt 56.54 ns 5.002 ns 0.274 ns 2.41
DotnetRuntimeUtf8ValidationRealData data/Bogatov136.utf8.txt 64.33 ns 12.238 ns 0.671 ns 2.11
SIMDUtf8ValidationRealData data/Bogatov286.utf8.txt 103.52 ns 28.395 ns 1.556 ns 2.76
DotnetRuntimeUtf8ValidationRealData data/Bogatov286.utf8.txt 122.99 ns 4.852 ns 0.266 ns 2.33
SIMDUtf8ValidationRealData data/Bogatov527.utf8.txt 166.78 ns 48.576 ns 2.663 ns 3.16
DotnetRuntimeUtf8ValidationRealData data/Bogatov527.utf8.txt 228.53 ns 20.843 ns 1.142 ns 2.31
SIMDUtf8ValidationRealData data/Chinese-Lipsum.utf8.txt 17,856.64 ns 1,337.536 ns 73.315 ns 3.93
DotnetRuntimeUtf8ValidationRealData data/Chinese-Lipsum.utf8.txt 24,567.18 ns 3,896.106 ns 213.559 ns 2.85
SIMDUtf8ValidationRealData data/Emoji-Lipsum.utf8.txt 16,430.91 ns 607.015 ns 33.273 ns 3.99
DotnetRuntimeUtf8ValidationRealData data/Emoji-Lipsum.utf8.txt 70,915.81 ns 7,409.099 ns 406.118 ns .92
SIMDUtf8ValidationRealData data/Hebrew-Lipsum.utf8.txt 16,775.94 ns 795.629 ns 43.611 ns 3.98
DotnetRuntimeUtf8ValidationRealData data/Hebrew-Lipsum.utf8.txt 29,303.90 ns 4,875.980 ns 267.269 ns 2.28
SIMDUtf8ValidationRealData data/Hindi-Lipsum.utf8.txt 22,299.15 ns 1,833.863 ns 100.520 ns 3.96
DotnetRuntimeUtf8ValidationRealData data/Hindi-Lipsum.utf8.txt 48,596.18 ns 15,311.568 ns 839.279 ns 1.81
SIMDUtf8ValidationRealData data/Japanese-Lipsum.utf8.txt 17,016.59 ns 823.252 ns 45.125 ns 4.00
DotnetRuntimeUtf8ValidationRealData data/Japanese-Lipsum.utf8.txt 25,304.66 ns 2,220.489 ns 121.712 ns 2.69
SIMDUtf8ValidationRealData data/Korean-Lipsum.utf8.txt 16,930.03 ns 1,849.846 ns 101.396 ns 3.95
DotnetRuntimeUtf8ValidationRealData data/Korean-Lipsum.utf8.txt 45,835.33 ns 4,527.210 ns 248.152 ns 1.46
SIMDUtf8ValidationRealData data/Latin-Lipsum.utf8.txt 1,781.82 ns 142.077 ns 7.788 ns 49.13
DotnetRuntimeUtf8ValidationRealData data/Latin-Lipsum.utf8.txt 4,411.55 ns 519.027 ns 28.450 ns 19.84
SIMDUtf8ValidationRealData data/Russian-Lipsum.utf8.txt 26,488.20 ns 1,004.885 ns 55.081 ns 3.97
DotnetRuntimeUtf8ValidationRealData data/Russian-Lipsum.utf8.txt 85,751.90 ns 2,763.754 ns 151.491 ns 1.23
SIMDUtf8ValidationRealData data/arabic.utf8.txt 116,702.66 ns 10,659.092 ns 584.261 ns 4.62
DotnetRuntimeUtf8ValidationRealData data/arabic.utf8.txt 304,136.14 ns 5,860.072 ns 321.210 ns 1.77
SIMDUtf8ValidationRealData data/chinese.utf8.txt 39,437.25 ns 14,305.781 ns 784.148 ns 4.65
DotnetRuntimeUtf8ValidationRealData data/chinese.utf8.txt 82,737.08 ns 65,550.916 ns 3,593.067 ns 2.21
SIMDUtf8ValidationRealData data/czech.utf8.txt 35,543.92 ns 8,874.709 ns 486.453 ns 4.36
DotnetRuntimeUtf8ValidationRealData data/czech.utf8.txt 64,462.97 ns 25,539.564 ns 1,399.910 ns 2.40
SIMDUtf8ValidationRealData data/english.utf8.txt 15,141.65 ns 1,261.944 ns 69.171 ns 26.10
DotnetRuntimeUtf8ValidationRealData data/english.utf8.txt 21,920.44 ns 3,256.271 ns 178.487 ns 18.03
SIMDUtf8ValidationRealData data/esperanto.utf8.txt 10,056.58 ns 640.372 ns 35.101 ns 8.78
DotnetRuntimeUtf8ValidationRealData data/esperanto.utf8.txt 12,170.85 ns 1,218.702 ns 66.801 ns 7.25
SIMDUtf8ValidationRealData data/french.utf8.txt 124,682.22 ns 14,272.929 ns 782.347 ns 3.63
DotnetRuntimeUtf8ValidationRealData data/french.utf8.txt 113,978.42 ns 13,604.615 ns 745.715 ns 3.97
SIMDUtf8ValidationRealData data/german.utf8.txt 28,153.13 ns 19,448.296 ns 1,066.027 ns 7.42
DotnetRuntimeUtf8ValidationRealData data/german.utf8.txt 26,574.55 ns 991.445 ns 54.344 ns 7.86
SIMDUtf8ValidationRealData data/greek.utf8.txt 34,434.19 ns 10,687.947 ns 585.842 ns 5.31
DotnetRuntimeUtf8ValidationRealData data/greek.utf8.txt 84,145.71 ns 2,522.699 ns 138.278 ns 2.17
SIMDUtf8ValidationRealData data/hebrew.utf8.txt 49,756.07 ns 6,599.353 ns 361.733 ns 3.87
DotnetRuntimeUtf8ValidationRealData data/hebrew.utf8.txt 128,841.85 ns 11,767.071 ns 644.993 ns 1.49
SIMDUtf8ValidationRealData data/hindi.utf8.txt 77,642.78 ns 16,456.606 ns 902.042 ns 5.14
DotnetRuntimeUtf8ValidationRealData data/hindi.utf8.txt 238,930.54 ns 10,400.088 ns 570.064 ns 1.67
SIMDUtf8ValidationRealData data/japanese.utf8.txt 38,426.26 ns 5,781.659 ns 316.912 ns 4.32
DotnetRuntimeUtf8ValidationRealData data/japanese.utf8.txt 61,724.55 ns 41,734.358 ns 2,287.601 ns 2.69
SIMDUtf8ValidationRealData data/korean.utf8.txt 21,931.73 ns 2,790.311 ns 152.946 ns 4.51
DotnetRuntimeUtf8ValidationRealData data/korean.utf8.txt 44,778.40 ns 10,278.324 ns 563.390 ns 2.21
SIMDUtf8ValidationRealData data/persan.utf8.txt 34,680.67 ns 5,538.926 ns 303.607 ns 4.56
DotnetRuntimeUtf8ValidationRealData data/persan.utf8.txt 87,568.89 ns 36,970.803 ns 2,026.494 ns 1.80
SIMDUtf8ValidationRealData data/portuguese.utf8.txt 61,765.02 ns 9,398.294 ns 515.152 ns 4.60
DotnetRuntimeUtf8ValidationRealData data/portuguese.utf8.txt 52,783.79 ns 11,341.717 ns 621.678 ns 5.38
SIMDUtf8ValidationRealData data/russian.utf8.txt 92,324.51 ns 24,401.787 ns 1,337.544 ns 4.45
DotnetRuntimeUtf8ValidationRealData data/russian.utf8.txt 242,225.92 ns 13,553.657 ns 742.922 ns 1.70
SIMDUtf8ValidationRealData data/thai.utf8.txt 118,329.35 ns 61,702.085 ns 3,382.099 ns 5.05
DotnetRuntimeUtf8ValidationRealData data/thai.utf8.txt 209,130.95 ns 20,627.516 ns 1,130.664 ns 2.86
SIMDUtf8ValidationRealData data/turkish.utf8.txt 41,358.75 ns 16,789.474 ns 920.288 ns 4.77
DotnetRuntimeUtf8ValidationRealData data/turkish.utf8.txt 66,221.55 ns 21,508.969 ns 1,178.979 ns 2.98
SIMDUtf8ValidationRealData data/twitter.json 45,178.43 ns 3,359.353 ns 184.137 ns 14.32
DotnetRuntimeUtf8ValidationRealData data/twitter.json 63,361.09 ns 2,731.947 ns 149.747 ns 10.21
SIMDUtf8ValidationRealData data/vietnamese.utf8.txt 78,546.02 ns 69,696.623 ns 3,820.307 ns 4.10
DotnetRuntimeUtf8ValidationRealData data/vietnamese.utf8.txt 260,017.63 ns 20,662.423 ns 1,132.577 ns 1.24

@lemire
Copy link

lemire commented Jun 24, 2024

@EgorBo I have done a bit more tuning, see simdutf/SimdUnicode#48

I have not tested on a Neoverse N1, but I am not overly optimistic given that these tiny chips have poor SIMD performance.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 25, 2024

Numbers on Apple M2 Max (high performance profile, on charge):

BenchmarkDotNet v0.13.12, macOS Sonoma 14.5 (23F79) [Darwin 23.5.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.4.24267.66
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  Job-BPOYMQ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-GJBNYV : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD


| Method | Toolchain     | str        | Mean      |
|------- |-------------- |----------- |----------:|
| Count  | /Main/corerun | Byte[1069] | 282.68 ns |
| Count  | /PR/corerun   | Byte[1069] | 182.91 ns |

| Count  | /Main/corerun | Byte[136]  |  33.08 ns |
| Count  | /PR/corerun   | Byte[136]  |  33.86 ns |

| Count  | /Main/corerun | Byte[286]  |  67.12 ns |
| Count  | /PR/corerun   | Byte[286]  |  62.97 ns |

| Count  | /Main/corerun | Byte[527]  | 131.71 ns |
| Count  | /PR/corerun   | Byte[527]  |  99.18 ns |

So the numbers definitely look a lot better for Apple M CPU

@EgorBot
Copy link

EgorBot commented Jun 25, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-HIFNUI : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-RANSKF : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_EnableAVX=0,DOTNET_ReadyToRun=0,DOTNET_TieredCompilation=0  StdDev=0.07 ns
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 561.6 ns 0.09 ns 1.00
Count PR Byte[1069] 550.2 ns 0.08 ns 0.98

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 25, 2024

@EgorBot -arm64 --envvars DOTNET_EnableAVX:0 DOTNET_ReadyToRun:0 DOTNET_TieredCompilation:0

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public unsafe class Utf8Bench
{
    void* _aligned64 = null;

    [GlobalSetup]
    public void GlobalSetup()
    {
        _aligned64 = NativeMemory.AlignedAlloc(2048, 64);
        ReadOnlySpan<byte> testData = "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8;
        testData.CopyTo(new Span<byte>(_aligned64, 2048));
    }

    [GlobalCleanup]
    public void GlobalCleanup() => NativeMemory.Free(_aligned64);

    [Benchmark]
    public bool IsValid_aligned() => Utf8.IsValid(new Span<byte>(_aligned64, 1069));
}

@dotnet dotnet deleted a comment from EgorBot Jun 25, 2024
@EgorBot
Copy link

EgorBot commented Jun 25, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-OLTAUW : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-FLKSTZ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_EnableAVX=0,DOTNET_ReadyToRun=0,DOTNET_TieredCompilation=0
Method Toolchain Mean Error Ratio
IsValid_aligned Main 557.8 ns 0.17 ns 1.00
IsValid_aligned PR 541.2 ns 1.14 ns 0.97

BDN_Artifacts.zip

@lemire
Copy link

lemire commented Jun 25, 2024

The problem on Neoverse N1 is that you have just two SIMD execution units, and the 8-bit add-across has terrible performance. Yet we need, somehow, to count the number of 4-byte characters. That becomes a bottleneck. We could amortize the cost over wider iteration but I'd rather avoid it that if I can.

See simdutf/SimdUnicode#49

On a Neoverse N1 (Graviton 2), our validation function is up to three times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 7.0 5.7 1.2 x
Arabic-Lipsum 2.2 0.9 2.4 x
Chinese-Lipsum 2.1 1.8 1.1 x
Emoji-Lipsum 1.8 0.7 2.6 x
Hebrew-Lipsum 2.0 0.9 2.2 x
Hindi-Lipsum 2.0 1.0 2.0 x
 Japanese-Lipsum 2.1 1.7  1.2 x
Korean-Lipsum 2.2 1.0 2.2 x
Latin-Lipsum 24 13 1.8 x
Russian-Lipsum 2.1 0.7 3.0 x

On your Russian (?) tests, the results are not rather neutral:

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Bogatov1069 1.85 1.61 1.15 x
Bogatov136 1.39 1.37 1.0 x
Bogatov286 1.48 1.53 0.97 x
Bogatov527 1.65 1.56 1.06 x

@EgorBo
Copy link
Member Author

EgorBo commented Jun 25, 2024

@EgorBot -arm64

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public unsafe class Utf8Bench
{
    void* _aligned64 = null;

    [GlobalSetup]
    public void GlobalSetup()
    {
        _aligned64 = NativeMemory.AlignedAlloc(2048, 64);
        ReadOnlySpan<byte> testData = "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8;
        testData.CopyTo(new Span<byte>(_aligned64, 2048));
    }

    [GlobalCleanup]
    public void GlobalCleanup() => NativeMemory.Free(_aligned64);

    [Benchmark]
    public bool IsValid_aligned() => Utf8.IsValid(new Span<byte>(_aligned64, 1069));
}

@EgorBo
Copy link
Member Author

EgorBo commented Jun 25, 2024

@EgorBot -arm64 -profiler

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public class Utf8Bench
{
    public static IEnumerable<byte[]> TestData()
    {
        // 1069 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8.ToArray();

        // 527 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит"u8.ToArray();

        // 286 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест."u8.ToArray();

        // 136 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид?"u8.ToArray();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public bool Count(byte[] str) => Utf8.IsValid(str);
}

@EgorBot
Copy link

EgorBot commented Jun 25, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-GWQVXL : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-XTHJKJ : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
IsValid_aligned Main 546.1 ns 0.29 ns 1.00
IsValid_aligned PR 445.7 ns 0.06 ns 0.82

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented Jun 25, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-NXHQXB : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-EBBMKR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 548.50 ns 2.361 ns 1.00
Count PR Byte[1069] 466.18 ns 0.050 ns 0.85
Count Main Byte[136] 75.74 ns 0.462 ns 0.14
Count PR Byte[136] 81.89 ns 0.009 ns 0.15
Count Main Byte[286] 148.92 ns 0.413 ns 0.27
Count PR Byte[286] 163.98 ns 0.007 ns 0.30
Count Main Byte[527] 270.42 ns 1.866 ns 0.49
Count PR Byte[527] 261.71 ns 0.016 ns 0.48

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@lemire
Copy link

lemire commented Jun 25, 2024

@EgorBo

Thanks for running this...

You see the very clear bottleneck? That's where we compute the continuation bytes...

IMG_0410

It is not uqsub that is blocking... it is almost surely the next addv. The profiler must get it wrong.

@lemire
Copy link

lemire commented Jun 26, 2024

@EgorBo

I have yet another design that you sidestep the performance issues on Neoverse N1:

simdutf/SimdUnicode#50

The trick is to do the sum across only once per string (or, once per blocks of ~4kB for long strings). On a Graviton 3 box, I get pretty decent results.

I am hoping that whatever Azure is using is about as good as a Graviton 2 (Neoverse N1).

Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData data/Bogatov1069.utf8.txt 511.27 ns 6.848 ns 0.375 ns 2.09
DotnetRuntimeUtf8ValidationRealData data/Bogatov1069.utf8.txt 665.46 ns 3.637 ns 0.199 ns 1.61
SIMDUtf8ValidationRealData data/Bogatov527.utf8.txt 281.23 ns 16.955 ns 0.929 ns 1.87
DotnetRuntimeUtf8ValidationRealData data/Bogatov527.utf8.txt 339.72 ns 2.321 ns 0.127 ns 1.55
SIMDUtf8ValidationRealData data/Bogatov136.utf8.txt 92.30 ns 1.172 ns 0.064 ns 1.47
DotnetRuntimeUtf8ValidationRealData data/Bogatov136.utf8.txt 97.24 ns 0.702 ns 0.038 ns 1.40
SIMDUtf8ValidationRealData data/Bogatov286.utf8.txt 174.95 ns 5.197 ns 0.285 ns 1.63
DotnetRuntimeUtf8ValidationRealData data/Bogatov286.utf8.txt 187.84 ns 1.224 ns 0.067 ns 1.52

So for the 1kB string, we have a 30% boost and about a 7% gain on the 286 byte string, with the 527 being in-between (20% gain). These are not super impressive but the Neoverse N1 has only two 16-byte SIMD execution units... there is only so much you can do.

If you ingest a sizeable JSON file (a few 100s of kilobytes), the results are more impressive:

Method FileName Mean Error StdDev Speed (GB/s)
SIMDUtf8ValidationRealData data/twitter.json 81.79 us 6.258 us 0.343 us 7.72
DotnetRuntimeUtf8ValidationRealData data/twitter.json 110.92 us 2.834 us 0.155 us 5.69

So a 35% boost.

These results are not as impressive as on x64 with powerful SIMD capabilities (i.e., AVX-512) or on Apple Silicon, but they are not bad if they pan out in your tests.

It is possible that there are methodological issues and testing is definitively required, but the hotspot should be different with this new code.

Try and try again until you succeed. :-)

@lemire
Copy link

lemire commented Jun 27, 2024

Graphical representation of my results on a Graviton 2 with PR simdutf/SimdUnicode#50

Bogatov1069
 SimdUnicode ▏  2.1 GB/s █████████████████████████
.NET Runtime ▏  1.6 GB/s ███████████████████▎

Bogatov527
 SimdUnicode ▏  1.9 GB/s █████████████████████████
.NET Runtime ▏  1.6 GB/s ████████████████████▋

Bogatov286
 SimdUnicode ▏  1.6 GB/s █████████████████████████
.NET Runtime ▏  1.5 GB/s ███████████████████████▎

Bogatov136
 SimdUnicode ▏  1.5 GB/s █████████████████████████
.NET Runtime ▏  1.4 GB/s ███████████████████████▊

Twitter.json
 SimdUnicode ▏  7.7 GB/s █████████████████████████
.NET Runtime ▏  5.7 GB/s ██████████████████▍

@EgorBo
Copy link
Member Author

EgorBo commented Jun 29, 2024

@EgorBot -arm64

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.Unicode;

BenchmarkRunner.Run<Utf8Bench>(args: args);

public class Utf8Bench
{
    public static IEnumerable<byte[]> TestData()
    {
        // 1069 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит, дицант глориатур персецути при еу. При еяуидем пхаедрум рецусабо ех, не вим ерант вертерем Ехерци семпер те нец. Ид нолуиссе детерруиссет нам, яуо ан адхуц дицит пертинациа, мел тота цлита цомпрехенсам ид? Ид аугуе граецис еффициенди вис, ат анимал фиерент инструцтиор пер, не виде еффициенди при!"u8.ToArray();

        // 527 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест. Но дицунт рецусабо диссентиас цум, оптион евертитур ан вих. Но мел антиопам молестиае, продессет абхорреант витуператорибус ат сит"u8.ToArray();

        // 286 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид? Еррорибус темпорибус адверсариум про те, видит ностер хас не, яуод феугаит цу ест."u8.ToArray();

        // 136 bytes, ru-ru Lorem impsum
        yield return "Лорем ипсум долор сит амет, хас тале феугаит ех, мел дицит сонет сцрипта ид?"u8.ToArray();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public bool Count(byte[] str) => Utf8.IsValid(str);
}

@EgorBot
Copy link

EgorBot commented Jun 29, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-NTVQGL : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-TGEKUW : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain str Mean Error Ratio
Count Main Byte[1069] 549.93 ns 4.071 ns 1.00
Count PR Byte[1069] 428.50 ns 0.051 ns 0.78
Count Main Byte[136] 75.70 ns 0.472 ns 0.14
Count PR Byte[136] 79.91 ns 0.030 ns 0.15
Count Main Byte[286] 147.47 ns 1.642 ns 0.27
Count PR Byte[286] 147.56 ns 0.058 ns 0.27
Count Main Byte[527] 268.08 ns 0.502 ns 0.49
Count PR Byte[527] 238.32 ns 0.016 ns 0.43

BDN_Artifacts.zip

@lemire
Copy link

lemire commented Jun 29, 2024

@EgorBo Your numbers broadly agrees with my tests this time. ❤️

I submit to you that tuning, such as insuring that GetPointerToFirstInvalidByteArm64 gets inlined, could improve further the results. :-)

@EgorBo EgorBo closed this Jun 29, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jul 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants