Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression come again on dotnet 9 #111016

Open
kingsznhone opened this issue Jan 1, 2025 · 8 comments · May be fixed by #112011
Open

Performance regression come again on dotnet 9 #111016

kingsznhone opened this issue Jan 1, 2025 · 8 comments · May be fixed by #112011
Assignees
Labels
area-System.Numerics in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@kingsznhone
Copy link

Description

#95954
With temporary resolution in this issue . I add this segment to fix dotnet 8 performance regression. It makes performance back to 90% of dotnet 7
https://github.com/kingsznhone/VSOP2013.NET/blob/8a9e03fd734d9c29de788877724e32022b042d21/VSOP2013.NET/Calculator.cs#L151

Few days ago. I try dotnet 9 to run perf test. Same situation occor as before.

Ethier I apply that fix or not. Performance still very bad.

This line cause performance heavily drop as before.
(su, cu) = Math.SinCos(u);

Data

``

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.101
[Host] : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2 [AttachedDebugger]
.NET 8.0 : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX2
.NET 9.0 : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

| Method  | Job      | Runtime  | Mean        | Error     | StdDev    | Ratio | Allocated | Alloc Ratio |
|-------- |--------- |--------- |------------:|----------:|----------:|------:|----------:|------------:|
| Compute | .NET 8.0 | .NET 8.0 |    739.2 μs |   8.05 μs |   7.14 μs |  1.00 |   2.91 KB |        1.00 |
|         |          |          |             |           |           |       |           |             |
| Compute | .NET 9.0 | .NET 9.0 | 29,018.1 μs | 183.88 μs | 172.00 μs |  1.00 |   2.92 KB |        1.00 |


@kingsznhone kingsznhone added the tenet-performance Performance related issue label Jan 1, 2025
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 1, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 1, 2025
@vcsjones vcsjones added area-System.Numerics and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 1, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

I do measure a regression, but nowhere near what you're seeing (on Intel or AMD)

This is due to an MSVC correctness bug that exists in native: https://developercommunity.visualstudio.com/t/MSVCs-sincos-implementation-is-incorrec/10582378 and thus .NET 9 was fixed to no longer use the /fp:fast implementation and invokes the /fp:precise one instead. .NET 8 retains the buggy behavior and will return an incorrect Cos result for some large inputs.

The exact performance is a bit dependent on the input, but it should be pessimizing to approximately the same performance as if you invoked Sin and Cos independently, which is what I'm getting locally.

Intel

BenchmarkDotNet v0.13.11, Windows 11 (10.0.26100.2605)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]   : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Job Runtime Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 1.397 ms 0.0119 ms 0.0117 ms 1.00 314 B 2.9 KB 1.00
Compute .NET 9.0 .NET 9.0 2.281 ms 0.0068 ms 0.0064 ms 1.00 314 B 2.9 KB 1.00

AMD

BenchmarkDotNet v0.13.11, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.200-preview.0.24575.35
  [Host]   : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Job Runtime Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 714.2 us 2.00 us 1.56 us 1.00 314 B 2.92 KB 1.00
Compute .NET 9.0 .NET 9.0 1,314.8 us 5.29 us 4.69 us 1.00 314 B 2.9 KB 1.00

@kingsznhone
Copy link
Author

kingsznhone commented Jan 2, 2025

I do measure a regression, but nowhere near what you're seeing (on Intel or AMD)

This is due to an MSVC correctness bug that exists in native: https://developercommunity.visualstudio.com/t/MSVCs-sincos-implementation-is-incorrec/10582378 and thus .NET 9 was fixed to no longer use the /fp:fast implementation and invokes the /fp:precise one instead. .NET 8 retains the buggy behavior and will return an incorrect Cos result for some large inputs.

The exact performance is a bit dependent on the input, but it should be pessimizing to approximately the same performance as if you invoked Sin and Cos independently, which is what I'm getting locally.

Intel

BenchmarkDotNet v0.13.11, Windows 11 (10.0.26100.2605)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100
  [Host]   : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Job Runtime Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 1.397 ms 0.0119 ms 0.0117 ms 1.00 314 B 2.9 KB 1.00
Compute .NET 9.0 .NET 9.0 2.281 ms 0.0068 ms 0.0064 ms 1.00 314 B 2.9 KB 1.00

AMD

BenchmarkDotNet v0.13.11, Windows 11 (10.0.26100.2605)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.200-preview.0.24575.35
  [Host]   : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Job Runtime Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 714.2 us 2.00 us 1.56 us 1.00 314 B 2.92 KB 1.00
Compute .NET 9.0 .NET 9.0 1,314.8 us 5.29 us 4.69 us 1.00 314 B 2.9 KB 1.00

I notice that your platform have avx512 support, but my laptop only got avx2. Or it might be a exclusive bug of 12/13/14th gen core CPU,lol

As you say, I think I should call sin & cos seperately in the future, to aviod strange behaviour. thanks

@jeffhandley
Copy link
Member

Assigned to @PranavSenthilnathan for triage.

@PranavSenthilnathan
Copy link
Member

The 40x perf difference between .NET 9 and .NET 8 seems very high. When I run the benchmark on AMD I also see only a 2x slowdown like Tanner. Could you check that the only difference between the two benchmarks is the runtime version and nothing else?

If you've confirmed that is not the issue, can you try changing NET8_0 to NET8_0_OR_GREATER? If this works it means that the JIT is not generating the equivalent of _ = GetZero();.

Also, a small, single file console app repro of the issue (using stackalloc and SinCos) would be helpful to narrow down the cause.

@kingsznhone
Copy link
Author

The 40x perf difference between .NET 9 and .NET 8 seems very high. When I run the benchmark on AMD I also see only a 2x slowdown like Tanner. Could you check that the only difference between the two benchmarks is the runtime version and nothing else?

If you've confirmed that is not the issue, can you try changing NET8_0 to NET8_0_OR_GREATER? If this works it means that the JIT is not generating the equivalent of _ = GetZero();.

Also, a small, single file console app repro of the issue (using stackalloc and SinCos) would be helpful to narrow down the cause.

I can confirm that I use NET8_0_OR_GREATER at my first post 3 weeks ago.

I just try a small change at https://github.com/kingsznhone/VSOP2013.NET/blob/c4620771f065c1c17fbf50929c71ad00087f083b/VSOP2013.NET/Calculator.cs#L20

from private static Vector128<float> GetZero() => Vector128<float>.Zero;
to private static Vector256<float> GetZero() => Vector256<float>.Zero;

Then everything going well at 2x slowdown


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 735.5 μs 11.35 μs 10.06 μs 1.00 2.9 KB 1.00
Compute .NET 9.0 .NET 9.0 1,447.2 μs 22.87 μs 17.85 μs 1.00 2.77 KB 1.00

@PranavSenthilnathan
Copy link
Member

So just to confirm - are these latest benchmark numbers with the following?

#if NET8_0_OR_GREATER 
_ = GetZero();
#endif

If you remove the GetZero call for .NET 9 (changing back to #if NET8_0 for example), what is the perf? I just want to double check that our change in .NET 9 fixed the initial perf regression you discovered in NET 8.

@kingsznhone
Copy link
Author

Fully Removed


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 11.09 ms 0.146 ms 0.137 ms 1.00 2.91 KB 1.00
Compute .NET 9.0 .NET 9.0 28.12 ms 0.474 ms 0.443 ms 1.00 2.91 KB 1.00

Right after add .net9 target

 [MethodImpl(MethodImplOptions.NoInlining)]
 private static Vector128<float> GetZero() => Vector128<float>.Zero;

......
#if NET8_0
            _ = GetZero();
#endif
BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 701.1 μs 3.12 μs 2.77 μs 1.00 2.89 KB 1.00
Compute .NET 9.0 .NET 9.0 28,392.2 μs 416.80 μs 389.87 μs 1.00 2.91 KB 1.00

same code for top of the issue

 [MethodImpl(MethodImplOptions.NoInlining)]
 private static Vector128<float> GetZero() => Vector128<float>.Zero;

......
#if NET8_0_OR_GREATER
            _ = GetZero();
#endif

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 696.7 μs 3.89 μs 3.63 μs 1.00 2.9 KB 1.00
Compute .NET 9.0 .NET 9.0 28,114.4 μs 170.22 μs 159.23 μs 1.00 2.91 KB 1.00

Last Modify

 [MethodImpl(MethodImplOptions.NoInlining)]
 private static Vector256<float> GetZero() => Vector256<float>.Zero;

......
#if NET8_0_OR_GREATER
            _ = GetZero();
#endif

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 700.5 μs 3.33 μs 3.11 μs 1.00 2.88 KB 1.00
Compute .NET 9.0 .NET 9.0 1,435.8 μs 24.11 μs 22.55 μs 1.00 2.9 KB 1.00

Use sin & cos separately without GetZero()

for (int n = 0; n < terms.Length; n++)
{
    double u = terms[n].aa + terms[n].bb * tj;
    double su = Math.Sin(u);
    double cu = Math.Cos(u);
    result += t[it] * (terms[n].ss * su + terms[n].cc * cu);
}

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 16 logical and 16 physical cores
.NET SDK 9.0.102
  [Host]   : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.12 (8.0.1224.60305), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Ratio Allocated Alloc Ratio
Compute .NET 8.0 .NET 8.0 1.442 ms 0.0121 ms 0.0107 ms 1.00 2.9 KB 1.00
Compute .NET 9.0 .NET 9.0 1.422 ms 0.0265 ms 0.0248 ms 1.00 2.9 KB 1.00

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jan 30, 2025
@PranavSenthilnathan PranavSenthilnathan removed the untriaged New issue has not been triaged by the area owner label Jan 30, 2025
@PranavSenthilnathan PranavSenthilnathan added this to the 10.0.0 milestone Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants