Skip to content

extrema could be faster #44606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LilithHafner opened this issue Mar 13, 2022 · 3 comments · Fixed by #58280
Closed

extrema could be faster #44606

LilithHafner opened this issue Mar 13, 2022 · 3 comments · Fixed by #58280
Labels
performance Must go faster

Comments

@LilithHafner
Copy link
Member

This implementation seems to be slightly faster than extrema:

function _extrema(v)
    mn = mx = first(v)
    @inbounds for i in firstindex(v)+1:lastindex(v)
        vi = v[i]
        vi < mn && (mn = vi)
        mx < vi && (mx = vi)
    end
    mn, mx
end

@belapsed extrema(x) setup=(x=rand(Int, 10_000))
# 2.475222222222222e-6
@belapsed _extrema(x) setup=(x=rand(Int, 10_000))
# 2.0984e-6

@belapsed extrema(x) setup=(x=rand(Int, 10))
# 1.018937875751503e-8
@belapsed _extrema(x) setup=(x=rand(Int, 10))
# 7.78878878878879e-9

Originally posted by @LilithHafner in #44230 (comment)

@oscardssmith oscardssmith added the performance Must go faster label Mar 13, 2022
@Moelf
Copy link
Contributor

Moelf commented Mar 14, 2022

@N5N3
Copy link
Member

N5N3 commented Mar 14, 2022

#31442 was mainly for float case though. (LLVM vectorlize Integer's extrema well).
Some of the performance difference comes from the default pairwise_blocksize(f, op) = 1024, which might be typemax(Int) for minimum/maximum/extrema, as the result should be the same.
The rest difference seems confusing. If I follows mapreduce_impl style

function __extrema(v)
    mn = mx = first(v)
    @inbounds if length(v) > 1
        mn = min(mn, v[2])
        mx = max(mx, v[2])
        for i in firstindex(v)+2:lastindex(v)
            vi = v[i]
            vi < mn && (mn = vi)
            mx < vi && (mx = vi)
        end
    end
    mn, mx
end

Then we have

julia> x = rand(Int, 4096);

julia> @btime _extrema($x);
  607.910 ns (0 allocations: 0 bytes)

julia> @btime __extrema($x);
  682.237 ns (0 allocations: 0 bytes)

julia> @btime extrema($x);
  724.627 ns (0 allocations: 0 bytes)

julia> Base.pairwise_blocksize(::Base.ExtremaMap, ::typeof(Base._extrema_rf)) = typemax(Int)

julia> @btime extrema($x);
  688.000 ns (0 allocations: 0 bytes) # quite close to `__extrema` !!!

@ViralBShah
Copy link
Member

Related: #34790

@LilithHafner LilithHafner mentioned this issue Mar 16, 2022
7 tasks
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this issue May 12, 2025
This method is no longer an optimization over what Julia can do with the
naive definition on most (if not all) architectures.

Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture
benchmarking of this simple example:

```julia
using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")
```

With results:

```txt
cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster
```

The results are even more dramatic for Float32s, here on my M1:

```julia
julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants