-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small performance improvement for avx-512 on Skylake-SP #191
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the scale should be 8 when using Scatter/Gather to access 2 32-bit elements as one 64-bit.
|
||
return _mm512_i32gather_ps(index, x, 4); | ||
return (V)_mm512_i32gather_pd(index, x, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (V)_mm512_i32gather_pd(index, x, 8);
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote that code a while ago, but I think 4 is correct; the indices are still referring to the original datatype - single precision value of 4 bytes. The_pd
variant is used only to access 64 bits at a time explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
/* pretend pair of single are a double */ | ||
const __m256i index = _mm256_set_epi32(7 * ovs, 6 * ovs, 5 * ovs, 4 * ovs, 3 * ovs, 2 * ovs, 1 * ovs, 0 * ovs); | ||
|
||
_mm512_i32scatter_pd(x, index, (__m512d)v, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm512_i32scatter_pd(x, index, (__m512d)v, 8);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - ovs
is a stride in 4-bytes elements, so the index
vector is also in 4-bytes element.
I've updated one of the commit messages after (finally) testing the code on KNL. |
…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL. This should help with FFTW#143.
This should improves slightly the performance by reducing the number of uops needed to do the gather/scatter.
@stevengj @matteo-frigo Can I merge this (old) one? I don't think KNL not liking the new code will be much of a problem by now, as I think most of the KNL-based systems have been retired (and there's a configure option to produce a KNL-friendly version anyway, as I still do own a KNL myself :-) ) |
This improves performance a bit on the Skylake-SP cores (Xeon Scalable), by replacing gather/scatter by slightly more efficient code: breaking down the instruction in 128 bits chunk for DP, and going for 64-bits scatter/gather (instead of 32) in SP. The original gather/scatter code is still available for DP, as it's probably faster on Knights Landing (KNL, Xeon Phi 72xx). The SP code should be a win on KNL as well.
Tested with make check/bigcheck, and for performance on synthetic code, could probably use some real-life testing for performance.