Skip to content

Speed up extrema 3-50x #58280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2025
Merged

Speed up extrema 3-50x #58280

merged 3 commits into from
May 2, 2025

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 29, 2025

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures.

Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example:

using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")

With results:

cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster

The results are even more dramatic for Float32s, here on my M1:

julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  49.083 μs  151.750 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min  max):  524.435 ns   1.104 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Closes #34790, closes #31442, closes #44606.

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures
@mbauman mbauman added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025
@giordano
Copy link
Contributor

Probably this and #58267 would make good benchmarks in https://github.com/JuliaCI/BaseBenchmarks.jl to better control upgrades of LLVM?

@giordano
Copy link
Contributor

Test failures look relevant

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

Yeah, interesting. Looks like some platforms don't maintain a consistent argument ordering of NaNs. I'm not sure if that's

  • something that we actually need extrema to satisfy; maybe we can relax this test?
  • something that we actually guarantee that min and max must satisfy; maybe this is a separate bug?

Here's the MWE on a skylake-avx512 machine:

julia> f((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> x = -NaN, -NaN;

julia> y = NaN, NaN;

julia> signbit.(f(x, y))
(false, true)

julia> signbit(min(-NaN,NaN))
false

julia> signbit(max(-NaN,NaN))
true

The test asserts that the signs returned by f here are the same — that it either always returns the first arguments to min/max or the second arguments. This doesn't happen on v1.11, but does on beta2 and master.

@oscardssmith
Copy link
Member

In general LLVM (and therefore we) make no guarentees about what NaN you will get.

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

OK, great, then we can just relax that test. I suspect it was written trying to allow for any ordering, but it missed the heterogeneous case.

Copy link
Contributor

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the reduction business, but for what is worth deleting code and getting better performance looks like a clear win to me.

@oscardssmith oscardssmith merged commit bb7b6e7 into master May 2, 2025
7 checks passed
@oscardssmith oscardssmith deleted the mb/extremaly-pessimized branch May 2, 2025 12:20
@ViralBShah
Copy link
Member

ViralBShah commented May 2, 2025

Should this be backported?

@oscardssmith
Copy link
Member

no. It's a performance improvement, not a bugfix.

@mbauman
Copy link
Member Author

mbauman commented May 2, 2025

This also isn't uniformly a win on Julias v1.11 and prior — probably because it's standing atop the same compiler change(s) that made #58267 fast (as Oscar notes, that's likely largely #56371).

charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
This method is no longer an optimization over what Julia can do with the
naive definition on most (if not all) architectures.

Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture
benchmarking of this simple example:

```julia
using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")
```

With results:

```txt
cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster
```

The results are even more dramatic for Float32s, here on my M1:

```julia
julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extrema could be faster Faster extrema on abstract arrays when there is no functions or selected dims Extrema is slower than maximum + minimum
4 participants