-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Help LLVM better vectorize reduction over skipmissing
#43859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This feels like something we should fix at the llvm level |
Thanks. I can't comment on whether the compiler can be improved to handle this or not. However, I'm surprised by the dramatic improvement that makes to benchmarks. Indeed, I had used this example in my blog post about |
Agree with this sentiment. @N5N3 what do the vectorizer remarks say, why it couldn't vectorize the original version? |
@simd for i = i:ilast | ||
@inbounds ai = A[i] | ||
if ai !== missing | ||
v = op(v, f(ai)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master vectorizer shows the following to sum(skipmissing(Vector{Union{Missing,Float64}}(randn(4096))))
:
LV: Checking a loop in "julia_mapreduce_impl_2411" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ]
LV: Interleaving disabled by the pass manager
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing: Cannot prove legality.
And the following for sum(skipmissing(Vector{Union{Missing,Int}}(rand(-1:1,4096))))
:
LV: Checking a loop in "julia_mapreduce_impl_2505" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: We can vectorize this loop!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ]
is incomplete there is only one: Found an unidentified PHI
message and that is longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the following:
function f(a)
r = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
if a[i] > 0
r += a[i]
end
end
r
end
function g(a)
r = zero(eltype(a))
@inbounds @simd for i in eachindex(a)
r += a[i] > 0 ? a[i] : 0
end
r
end
g(randn(4096))
is vectorlized by LLVM while f(randn(4096))
not. I tried rewriting f(a)
in c, the output of Clang shows some simd IR (Although the simd width is only 2). Maybe related with #31862?
noop = _fast_noop(op, _return_type(f, Tuple{nonmissingtype(eltype(A))}), v) | ||
if isnothing(noop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do both branches provide bit-identical results? Otherwise, relying on the inference result like this is not an optimization and it introduces unpredictable behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bit-identity, I test it locally and there seems no problem. (Edit: I mean a single operation, not the entire reduction, the result of sum
should be different after we vectorlizing this loop.)
_return_type(f, Tuple{nonmissingtype(eltype(A))})
was used to make sure fskip(x) = ismissing(x) ? noop : f(x)
is stable. (In this sence, I agree that we'd better fix this at LLVM level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following seems also work for the original purpose:
noop = _fast_noop(op, typeof(v), v) #
@simd for i = i:ilast
@inbounds ai = A[i]
v = op(v, ismissing(ai) ? oftype(v, noop) : f(ai))
end
But for inputs like Union{Int,Int32,Missing}
there're about 5%~10% speed regression. Not sure is it Ok.
Looks like LLVM can't handle "conditonal" reduction well for float cases.
Local Benchmark shows: