Skip to content

Faster extrema on abstract arrays when there is no functions or selected dims #34790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
longemen3000 opened this issue Feb 17, 2020 · 5 comments · Fixed by #58280
Closed

Faster extrema on abstract arrays when there is no functions or selected dims #34790

longemen3000 opened this issue Feb 17, 2020 · 5 comments · Fixed by #58280
Labels
arrays [a, r, r, a, y, s] performance Must go faster

Comments

@longemen3000
Copy link
Contributor

longemen3000 commented Feb 17, 2020

for reference, see this discourse thread:
https://discourse.julialang.org/t/perf-improvement-suggestion-for-simple-extrema/34776/2
the key is just writing a simple loop in the case of extrema(x::AbstractArray). the post uses an standard indexing, but i think this can be changed?

@longemen3000 longemen3000 changed the title Faster extrema Faster extrema on abstract arrays when there is no functions or selected dims Feb 17, 2020
@longemen3000
Copy link
Contributor Author

longemen3000 commented Feb 17, 2020

a prototype

function extrema(x::AbstractArray)
  a = b = first(x)
  @inbounds @simd for i in eachindex(x)
    if x[i] > b 
      b = x[i]
    elseif x[i] < a 
      a = x[i]
    end
  end
  a, b
end

@tkf
Copy link
Member

tkf commented Feb 18, 2020

Isn't it more or less a dup of #31442?

@longemen3000
Copy link
Contributor Author

not technically, but it seems so

@brenhinkeller brenhinkeller added performance Must go faster arrays [a, r, r, a, y, s] labels Nov 20, 2022
@johnnychen94
Copy link
Member

I've checked against the loop version on both Intel's consumer CPU and server CPU, and the first direct conclusion is:
introducing extrema_loop as a specialization for extrema(A::AbstractArray) is a good change overall.

I'd be curious on why extrema_loop isn't faster than minmax for consumer CPU, but it doesn't make things worse 🤷‍♂

using BenchmarkTools

function extrema_mm(x)
    mn, mx = minimum(x), maximum(x)
    return (mn, mx)
end

function extrema_loop(x)
    mn, mx = x[1], x[1]
    @inbounds @simd for i in eachindex(x)
        v = x[i]
        mn = ifelse(v < mn, v, mn)
        mx = ifelse(v > mx, v, mx)
    end
    return (mn, mx)
end

A = rand(1_000_000)

@btime extrema(A)
@btime extrema_mm(A)
@btime extrema_loop(A)

CPU: Intel(R) Core(TM) i7-12700H

supported instruction set: Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores
version function time (ms)
1.9.3 extrema 0.971
1.9.3 extrema_mm 1.185
1.9.3 extrema_loop 0.848
1.10.9 extrema 2.364
1.10.9 extrema_mm 1.161
1.10.9 extrema_loop 2.067
1.11.4 extrema 2.421
1.11.4 extrema_mm 1.391
1.11.4 extrema_loop 2.082
1.12.0-beta1 extrema 2.401
1.12.0-beta1 extrema_mm 0.981
1.12.0-beta1 extrema_loop 0.862

CPU: Intel(R) Xeon(R) Gold 5318Y

supported instruction set: Intel® SSE4.2, Intel® AVX, Intel® AVX2, Intel® AVX-512

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 96 virtual cores
version function time (ms)
1.9.3 extrema 4.894
1.9.3 extrema_mm 1.562
1.9.3 extrema_loop 1.344
1.10.9 extrema 3.612
1.10.9 extrema_mm 1.861
1.10.9 extrema_loop 1.343
1.11.5 extrema 3.091
1.11.5 extrema_mm 1.477
1.11.5 extrema_loop 1.183
1.12.0-beta1 extrema 5.109
1.12.0-beta1 extrema_mm 1.660
1.12.0-beta1 extrema_loop 1.339

@barucden
Copy link
Contributor

Note that the current extrema_loop is not equivalent to extrema for floats:

julia> function extrema_loop(x)
           mn, mx = x[1], x[1]
           @inbounds @simd for i in eachindex(x)
               v = x[i]
               mn = ifelse(v < mn, v, mn)
               mx = ifelse(v > mx, v, mx)
           end
           return (mn, mx)
       end
extrema_loop (generic function with 1 method)

julia> extrema([1., NaN])
(NaN, NaN)

julia> extrema_loop([1., NaN])
(1.0, 1.0)

charleskawczynski pushed a commit to charleskawczynski/julia that referenced this issue May 12, 2025
This method is no longer an optimization over what Julia can do with the
naive definition on most (if not all) architectures.

Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture
benchmarking of this simple example:

```julia
using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")
```

With results:

```txt
cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster
```

The results are even more dramatic for Float32s, here on my M1:

```julia
julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants