Speed up extrema 3-50x #58280

mbauman · 2025-04-29T20:11:07Z

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures.

Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example:

using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")

With results:

cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster

The results are even more dramatic for Float32s, here on my M1:

julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Closes #34790, closes #31442, closes #44606.

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures

giordano · 2025-04-29T20:14:31Z

Probably this and #58267 would make good benchmarks in https://github.com/JuliaCI/BaseBenchmarks.jl to better control upgrades of LLVM?

giordano · 2025-04-30T08:13:59Z

Test failures look relevant

mbauman · 2025-04-30T13:33:46Z

Yeah, interesting. Looks like some platforms don't maintain a consistent argument ordering of NaNs. I'm not sure if that's

something that we actually need extrema to satisfy; maybe we can relax this test?
something that we actually guarantee that min and max must satisfy; maybe this is a separate bug?

Here's the MWE on a skylake-avx512 machine:

julia> f((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));

julia> x = -NaN, -NaN;

julia> y = NaN, NaN;

julia> signbit.(f(x, y))
(false, true)

julia> signbit(min(-NaN,NaN))
false

julia> signbit(max(-NaN,NaN))
true

The test asserts that the signs returned by f here are the same — that it either always returns the first arguments to min/max or the second arguments. This doesn't happen on v1.11, but does on beta2 and master.

oscardssmith · 2025-04-30T13:40:34Z

In general LLVM (and therefore we) make no guarentees about what NaN you will get.

mbauman · 2025-04-30T13:48:59Z

OK, great, then we can just relax that test. I suspect it was written trying to allow for any ordering, but it missed the heterogeneous case.

giordano

I'm not familiar with the reduction business, but for what is worth deleting code and getting better performance looks like a clear win to me.

ViralBShah · 2025-05-02T12:47:22Z

Should this be backported?

oscardssmith · 2025-05-02T12:56:49Z

no. It's a performance improvement, not a bugfix.

mbauman · 2025-05-02T14:55:04Z

This also isn't uniformly a win on Julias v1.11 and prior — probably because it's standing atop the same compiler change(s) that made #58267 fast (as Oscar notes, that's likely largely #56371).

@benchmark

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606. --------- Co-authored-by: Mosè Giordano <[email protected]>

Speed up extrema 3-50x

dcfb327

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures

mbauman added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025

relax unordered test to actually allow any ordering

f3c440c

mbauman mentioned this pull request May 1, 2025

@fastmath support for sum, prod, extrema and extrema! #49910

Closed

Merge branch 'master' into mb/extremaly-pessimized

9f38687

giordano approved these changes May 2, 2025

View reviewed changes

oscardssmith merged commit bb7b6e7 into master May 2, 2025
7 checks passed

oscardssmith deleted the mb/extremaly-pessimized branch May 2, 2025 12:20

giordano mentioned this pull request May 2, 2025

Remove bugged and typically slower minimum/maximum method #58267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speed up extrema 3-50x #58280

Speed up extrema 3-50x #58280

Uh oh!

mbauman commented Apr 29, 2025 •

edited by oscardssmith

Loading

Uh oh!

giordano commented Apr 29, 2025

Uh oh!

giordano commented Apr 30, 2025

Uh oh!

mbauman commented Apr 30, 2025 •

edited

Loading

Uh oh!

oscardssmith commented Apr 30, 2025

Uh oh!

mbauman commented Apr 30, 2025

Uh oh!

giordano left a comment

Uh oh!

Uh oh!

ViralBShah commented May 2, 2025 •

edited

Loading

Uh oh!

oscardssmith commented May 2, 2025

Uh oh!

mbauman commented May 2, 2025

Uh oh!

Uh oh!

Uh oh!

Speed up extrema 3-50x #58280

Speed up extrema 3-50x #58280

Uh oh!

Conversation

mbauman commented Apr 29, 2025 • edited by oscardssmith Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Apr 29, 2025

Uh oh!

giordano commented Apr 30, 2025

Uh oh!

mbauman commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscardssmith commented Apr 30, 2025

Uh oh!

mbauman commented Apr 30, 2025

Uh oh!

giordano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ViralBShah commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscardssmith commented May 2, 2025

Uh oh!

mbauman commented May 2, 2025

Uh oh!

Uh oh!

mbauman commented Apr 29, 2025 •

edited by oscardssmith

Loading

mbauman commented Apr 30, 2025 •

edited

Loading

ViralBShah commented May 2, 2025 •

edited

Loading