Help LLVM better vectorize reduction over `skipmissing` #43859

N5N3 · 2022-01-19T08:43:29Z

Looks like LLVM can't handle "conditonal" reduction well for float cases.
Local Benchmark shows:

          ("skipmissing", "sum", "Union{Missing, Float64}", 1) => TrialJudgement(-79.78% => improvement)
          ("skipmissing", "sum", "Union{Missing, Float32}", 1) => TrialJudgement(-84.44% => improvement)
          ("skipmissing", "sum", "Union{Missing, Int64}", 1) => TrialJudgement(-5.56% => improvement)
          ("skipmissing", "sum", "Union{Missing, ComplexF64}", 1) => TrialJudgement(-58.43% => improvement)

vtjnash · 2022-01-19T15:54:14Z

This feels like something we should fix at the llvm level

nalimilan · 2022-01-20T17:09:17Z

Thanks. I can't comment on whether the compiler can be improved to handle this or not.

However, I'm surprised by the dramatic improvement that makes to benchmarks. Indeed, I had used this example in my blog post about missing to show how fast it was. Granted, I illustrated this using Int32, but @code_llvm shows that SIMD instructions are also used for Float64. Does the improvement in this PR mean that even more efficient SIMD code can be generated? That's definitely interesting.

vchuravy · 2022-01-20T17:16:00Z

This feels like something we should fix at the llvm level

Agree with this sentiment. @N5N3 what do the vectorizer remarks say, why it couldn't vectorize the original version?

N5N3 · 2022-01-21T05:37:39Z

base/missing.jl

-        @simd for i = i:ilast
-            @inbounds ai = A[i]
-            if ai !== missing
-                v = op(v, f(ai))


On master vectorizer shows the following to sum(skipmissing(Vector{Union{Missing,Float64}}(randn(4096)))) :

LV: Checking a loop in "julia_mapreduce_impl_2411" from simdloop.jl:75 @[ missing.jl:348 ] LV: Loop hints: force=? width=0 interleave=0 LV: Found a loop: L137 LV: Found an induction variable. LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ] LV: Interleaving disabled by the pass manager LV: Can't vectorize the instructions or CFG LV: Not vectorizing: Cannot prove legality.

And the following for sum(skipmissing(Vector{Union{Missing,Int}}(rand(-1:1,4096)))):

LV: Checking a loop in "julia_mapreduce_impl_2505" from simdloop.jl:75 @[ missing.jl:348 ] LV: Loop hints: force=? width=0 interleave=0 LV: Found a loop: L137 LV: Found an induction variable. LV: We can vectorize this loop!

Weird that LV: Not vectorizing: Found an unidentified PHI %value_phi30161 = phi double [ %117, %L137.lr.ph ], [ %value_phi33, %L137 ] is incomplete there is only one: Found an unidentified PHI message and that is longer.

I tried the following:

function f(a) r = zero(eltype(a)) @inbounds @simd for i in eachindex(a) if a[i] > 0 r += a[i] end end r end function g(a) r = zero(eltype(a)) @inbounds @simd for i in eachindex(a) r += a[i] > 0 ? a[i] : 0 end r end

g(randn(4096)) is vectorlized by LLVM while f(randn(4096)) not. I tried rewriting f(a) in c, the output of Clang shows some simd IR (Although the simd width is only 2). Maybe related with #31862?

tkf · 2022-01-21T06:49:28Z

base/missing.jl

+        noop = _fast_noop(op, _return_type(f, Tuple{nonmissingtype(eltype(A))}), v)
+        if isnothing(noop)


Do both branches provide bit-identical results? Otherwise, relying on the inference result like this is not an optimization and it introduces unpredictable behavior.

For bit-identity, I test it locally and there seems no problem. (Edit: I mean a single operation, not the entire reduction, the result of sum should be different after we vectorlizing this loop.)
_return_type(f, Tuple{nonmissingtype(eltype(A))}) was used to make sure fskip(x) = ismissing(x) ? noop : f(x) is stable. (In this sence, I agree that we'd better fix this at LLVM level)

The following seems also work for the original purpose:

noop = _fast_noop(op, typeof(v), v) # @simd for i = i:ilast @inbounds ai = A[i] v = op(v, ismissing(ai) ? oftype(v, noop) : f(ai)) end

But for inputs like Union{Int,Int32,Missing} there're about 5%~10% speed regression. Not sure is it Ok.

Help LLVM better vectorize reduction over skipmissing

9fe9e98

N5N3 added performance Must go faster missing data Base.missing and related functionality labels Jan 19, 2022

N5N3 requested a review from nalimilan January 19, 2022 08:44

N5N3 commented Jan 21, 2022

View reviewed changes

tkf reviewed Jan 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Help LLVM better vectorize reduction over `skipmissing` #43859

Help LLVM better vectorize reduction over `skipmissing` #43859

Uh oh!

N5N3 commented Jan 19, 2022

Uh oh!

vtjnash commented Jan 19, 2022

Uh oh!

nalimilan commented Jan 20, 2022

Uh oh!

vchuravy commented Jan 20, 2022

Uh oh!

N5N3 Jan 21, 2022

Uh oh!

vchuravy Jan 21, 2022

Uh oh!

N5N3 Jan 22, 2022 •

edited

Loading

Uh oh!

tkf Jan 21, 2022

Uh oh!

N5N3 Jan 21, 2022 •

edited

Loading

Uh oh!

N5N3 Jan 22, 2022

Uh oh!

Uh oh!

		noop = _fast_noop(op, _return_type(f, Tuple{nonmissingtype(eltype(A))}), v)
		if isnothing(noop)

Uh oh!

Help LLVM better vectorize reduction over skipmissing #43859

Are you sure you want to change the base?

Help LLVM better vectorize reduction over skipmissing #43859

Uh oh!

Conversation

N5N3 commented Jan 19, 2022

Uh oh!

vtjnash commented Jan 19, 2022

Uh oh!

nalimilan commented Jan 20, 2022

Uh oh!

vchuravy commented Jan 20, 2022

Uh oh!

N5N3 Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

vchuravy Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

N5N3 Jan 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkf Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

N5N3 Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

N5N3 Jan 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Help LLVM better vectorize reduction over `skipmissing` #43859

Help LLVM better vectorize reduction over `skipmissing` #43859

N5N3 Jan 22, 2022 •

edited

Loading

N5N3 Jan 21, 2022 •

edited

Loading