Skip to content

Help LLVM better vectorize reduction over skipmissing #43859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Jan 19, 2022

Looks like LLVM can't handle "conditonal" reduction well for float cases.
Local Benchmark shows:

          ("skipmissing", "sum", "Union{Missing, Float64}", 1) => TrialJudgement(-79.78% => improvement)
          ("skipmissing", "sum", "Union{Missing, Float32}", 1) => TrialJudgement(-84.44% => improvement)
          ("skipmissing", "sum", "Union{Missing, Int64}", 1) => TrialJudgement(-5.56% => improvement)
          ("skipmissing", "sum", "Union{Missing, ComplexF64}", 1) => TrialJudgement(-58.43% => improvement)

@N5N3 N5N3 added performance Must go faster missing data Base.missing and related functionality labels Jan 19, 2022
@N5N3 N5N3 requested a review from nalimilan January 19, 2022 08:44
@vtjnash
Copy link
Member

vtjnash commented Jan 19, 2022

This feels like something we should fix at the llvm level

@nalimilan
Copy link
Member

Thanks. I can't comment on whether the compiler can be improved to handle this or not.

However, I'm surprised by the dramatic improvement that makes to benchmarks. Indeed, I had used this example in my blog post about missing to show how fast it was. Granted, I illustrated this using Int32, but @code_llvm shows that SIMD instructions are also used for Float64. Does the improvement in this PR mean that even more efficient SIMD code can be generated? That's definitely interesting.

@vchuravy
Copy link
Member

This feels like something we should fix at the llvm level

Agree with this sentiment. @N5N3 what do the vectorizer remarks say, why it couldn't vectorize the original version?

Comment on lines -348 to -351
@simd for i = i:ilast
@inbounds ai = A[i]
if ai !== missing
v = op(v, f(ai))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master vectorizer shows the following to sum(skipmissing(Vector{Union{Missing,Float64}}(randn(4096)))) :

LV: Checking a loop in "julia_mapreduce_impl_2411" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: Not vectorizing: Found an unidentified PHI   %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ]
LV: Interleaving disabled by the pass manager
LV: Can't vectorize the instructions or CFG
LV: Not vectorizing: Cannot prove legality.

And the following for sum(skipmissing(Vector{Union{Missing,Int}}(rand(-1:1,4096)))):

LV: Checking a loop in "julia_mapreduce_impl_2505" from simdloop.jl:75 @[ missing.jl:348 ]
LV: Loop hints: force=? width=0 interleave=0
LV: Found a loop: L137
LV: Found an induction variable.
LV: We can vectorize this loop!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ] is incomplete there is only one: Found an unidentified PHI message and that is longer.

Copy link
Member Author

@N5N3 N5N3 Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the following:

function f(a)
    r = zero(eltype(a))
    @inbounds @simd for i in eachindex(a)
       if a[i] > 0
           r += a[i]
       end
   end
   r
end
function g(a)
   r = zero(eltype(a))
   @inbounds @simd for i in eachindex(a)
       r += a[i] > 0 ? a[i] : 0
   end
   r
end

g(randn(4096)) is vectorlized by LLVM while f(randn(4096)) not. I tried rewriting f(a) in c, the output of Clang shows some simd IR (Although the simd width is only 2). Maybe related with #31862?

Comment on lines +358 to +359
noop = _fast_noop(op, _return_type(f, Tuple{nonmissingtype(eltype(A))}), v)
if isnothing(noop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do both branches provide bit-identical results? Otherwise, relying on the inference result like this is not an optimization and it introduces unpredictable behavior.

Copy link
Member Author

@N5N3 N5N3 Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bit-identity, I test it locally and there seems no problem. (Edit: I mean a single operation, not the entire reduction, the result of sum should be different after we vectorlizing this loop.)
_return_type(f, Tuple{nonmissingtype(eltype(A))}) was used to make sure fskip(x) = ismissing(x) ? noop : f(x) is stable. (In this sence, I agree that we'd better fix this at LLVM level)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following seems also work for the original purpose:

noop = _fast_noop(op, typeof(v), v) # 
@simd for i = i:ilast
     @inbounds ai = A[i]
     v = op(v, ismissing(ai) ? oftype(v, noop) : f(ai))
end

But for inputs like Union{Int,Int32,Missing} there're about 5%~10% speed regression. Not sure is it Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants