-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
extrema
could be faster
#44606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
performance
Must go faster
Comments
#31442 was mainly for float case though. (LLVM vectorlize Integer's extrema well). function __extrema(v)
mn = mx = first(v)
@inbounds if length(v) > 1
mn = min(mn, v[2])
mx = max(mx, v[2])
for i in firstindex(v)+2:lastindex(v)
vi = v[i]
vi < mn && (mn = vi)
mx < vi && (mx = vi)
end
end
mn, mx
end Then we have julia> x = rand(Int, 4096);
julia> @btime _extrema($x);
607.910 ns (0 allocations: 0 bytes)
julia> @btime __extrema($x);
682.237 ns (0 allocations: 0 bytes)
julia> @btime extrema($x);
724.627 ns (0 allocations: 0 bytes)
julia> Base.pairwise_blocksize(::Base.ExtremaMap, ::typeof(Base._extrema_rf)) = typemax(Int)
julia> @btime extrema($x);
688.000 ns (0 allocations: 0 bytes) # quite close to `__extrema` !!! |
Related: #34790 |
charleskawczynski
pushed a commit
to charleskawczynski/julia
that referenced
this issue
May 12, 2025
This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606. --------- Co-authored-by: Mosè Giordano <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This implementation seems to be slightly faster than
extrema
:Originally posted by @LilithHafner in #44230 (comment)
The text was updated successfully, but these errors were encountered: