-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
sum(a) is now 30% slower than NumPy #30290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I (and the audience) noticed this last week when I used the notebook in a presentation. Looks like Intel has been looking into this since I suspect this could be numpy/numpy#10251. |
This PR numpy/numpy#11113 removes all the intrinsic stuff and says the compiler can do it by itself with the same performance. So not sure why we get less with LLVM then. |
Note that numpy11113 does not seem to specifically benchmark |
I've been seeing the same thing on a 2018 2.7 GHz Intel Core i7, so this is something that's widespread across a pretty wide range of Intel CPUs. Makes the "variations on sum" / "Julia is fast" summation notebook a bit of a sad trumpet demo since NumPy wins. |
Are the summed values equal? |
Yes. They are within error for floating points, but not exactly equal as floats. |
Just an update. I don't think this is caused by the PR I linked to. I tried out older versions of Numpy and they are all (back to and including 1.11) as fast. |
FYI, if you are using |
I built the Numpy version from source with the default |
Can you see the code generated by Numpy? |
If you look in the notebook you can see |
I meant "do we know the instructions that Numpy generates to perform the summation and how those instructions (emitted assembler that is the summation loop) differ from what Julia via LLVM executes?" |
Would be interesting to see whether updating Julia's LLVM to the latest version would solve the issue. |
|
Sorry, ignore that — I misread the assembly. |
On my notebook (i7-6700HQ, not AVX-512), using conda distributed Numpy I believe this is an AVX-512 optimization issue. Perhaps conda Numpy distributions are compiled with
|
If you use `-mprefer-vector-width=512` and `-ffast-math` then clang 7.0 and
gcc 8.2 will use zmm registers.
https://godbolt.org/z/r8lQTH
gcc's code looks awkward, while clang has a rather extreme amount of
unrolling (by a factor of 128 doubles!), then another body that unrolls by
a factor of 32.
`ssh` is failing me right now, but I could benchmark on a skylake-x
architecture in a couple days when I get home.
Julia's `@simd` code tends to match Clang's. I would guess that this wins for long loops by a small margin (icc close behind), while icc is much faster for small loops. And that gcc tends slower than both.
…On Mon, Dec 31, 2018 at 4:23 PM Hamza Yusuf Çakır ***@***.***> wrote:
On my notebook (i7-6700HQ), using conda distributed Numpy sum does not
give any performance benefit. Seems to be that conda Numpy distributions
are compiled with icc.
I believe this is a AVX-512 optimization issue.
Looking at the output from gcc 8.2 vs icc 19.0 following the link below,
icc uses 512-bit registers while gcc or clang does not.
https://godbolt.org/z/we75gC
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#30290 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHq8U_YNQ5D7AKwG9srMt-td9ofYZAupks5u-o5fgaJpZM4ZGktk>
.
|
Ah, I see. |
One idea might be to check whether this happens with numpy from Conda or numpy from pip. The one from Conda has Intel specific code and uses VML |
I just noticed this as well on 1.5.3 (~25% faster numpy). On julia#master (i.e. 1.7-dev) I find that numpy is "only" ~12% faster. Still unfortunate that we are slower here (in particular for demonstration/selling purposes). |
I'm not sure if there's much value to keeping this open: this is generally a symptom of LLVM not generating as well optimized simd code for newer hardware as numpy's hand coded kernels. We generally catch up as soon as llvm learns how to do as well as the hand written code but there's always newer hardware. Not sure what is actionable. |
I love closing issues |
Hand coded kernels of course adds a lot of maintenance burden. LLVM is normally way (months or years) ahead of OpenBLAS in supporting new architectures. So just stating the obvious of why I don't think hand coded kernels are the best idea. |
I was updating my performance-optimization lecture notes from last year to Julia 1.0, which start with a comparison of C, Python, and Julia
sum
functions, and I noticed something odd:Both the Julia
sum(::Vector{Float64})
function and the NumPysum
function are faster than last year (yay for compiler improvements?). Last year, Julia and NumPysum
had almost identical speed, but now the NumPysum
function is now about 30% faster than Julia.I'm running a 2016 Intel core i7, the same as last year. So apparently the NumPy
sum
function has gotten some new optimization that we don't have? (I did switch from Python 2 to Python 3; I'm using the Conda Python.) Some kind of missing SIMD optimization?I'm not so concerned about
sum
per se, but this is a pretty basic function — if we are leaving 30% on the table here, then we might be missing performance opportunities in many other places too.The text was updated successfully, but these errors were encountered: