-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Faster extrema on abstract arrays when there is no functions or selected dims #34790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
a prototype function extrema(x::AbstractArray)
a = b = first(x)
@inbounds @simd for i in eachindex(x)
if x[i] > b
b = x[i]
elseif x[i] < a
a = x[i]
end
end
a, b
end |
Isn't it more or less a dup of #31442? |
not technically, but it seems so |
I've checked against the loop version on both Intel's consumer CPU and server CPU, and the first direct conclusion is: I'd be curious on why using BenchmarkTools
function extrema_mm(x)
mn, mx = minimum(x), maximum(x)
return (mn, mx)
end
function extrema_loop(x)
mn, mx = x[1], x[1]
@inbounds @simd for i in eachindex(x)
v = x[i]
mn = ifelse(v < mn, v, mn)
mx = ifelse(v > mx, v, mx)
end
return (mn, mx)
end
A = rand(1_000_000)
@btime extrema(A)
@btime extrema_mm(A)
@btime extrema_loop(A) CPU: Intel(R) Core(TM) i7-12700Hsupported instruction set: Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2 Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 1 on 20 virtual cores
CPU: Intel(R) Xeon(R) Gold 5318Ysupported instruction set: Intel® SSE4.2, Intel® AVX, Intel® AVX2, Intel® AVX-512 Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 96 × Intel(R) Xeon(R) Gold 5318Y CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
Threads: 1 on 96 virtual cores
|
Note that the current julia> function extrema_loop(x)
mn, mx = x[1], x[1]
@inbounds @simd for i in eachindex(x)
v = x[i]
mn = ifelse(v < mn, v, mn)
mx = ifelse(v > mx, v, mx)
end
return (mn, mx)
end
extrema_loop (generic function with 1 method)
julia> extrema([1., NaN])
(NaN, NaN)
julia> extrema_loop([1., NaN])
(1.0, 1.0) |
This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606. --------- Co-authored-by: Mosè Giordano <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
for reference, see this discourse thread:
https://discourse.julialang.org/t/perf-improvement-suggestion-for-simple-extrema/34776/2
the key is just writing a simple loop in the case of
extrema(x::AbstractArray)
. the post uses an standard indexing, but i think this can be changed?The text was updated successfully, but these errors were encountered: