Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small performance improvement for avx-512 on Skylake-SP #191

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rdolbeau
Copy link
Contributor

This improves performance a bit on the Skylake-SP cores (Xeon Scalable), by replacing gather/scatter by slightly more efficient code: breaking down the instruction in 128 bits chunk for DP, and going for 64-bits scatter/gather (instead of 32) in SP. The original gather/scatter code is still available for DP, as it's probably faster on Knights Landing (KNL, Xeon Phi 72xx). The SP code should be a win on KNL as well.

Tested with make check/bigcheck, and for performance on synthetic code, could probably use some real-life testing for performance.

Copy link

@xiegengxin xiegengxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scale should be 8 when using Scatter/Gather to access 2 32-bit elements as one 64-bit.


return _mm512_i32gather_ps(index, x, 4);
return (V)_mm512_i32gather_pd(index, x, 4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (V)_mm512_i32gather_pd(index, x, 8);
Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote that code a while ago, but I think 4 is correct; the indices are still referring to the original datatype - single precision value of 4 bytes. The_pd variant is used only to access 64 bits at a time explicitly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

/* pretend pair of single are a double */
const __m256i index = _mm256_set_epi32(7 * ovs, 6 * ovs, 5 * ovs, 4 * ovs, 3 * ovs, 2 * ovs, 1 * ovs, 0 * ovs);

_mm512_i32scatter_pd(x, index, (__m512d)v, 4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_i32scatter_pd(x, index, (__m512d)v, 8);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - ovs is a stride in 4-bytes elements, so the index vector is also in 4-bytes element.

@rdolbeau
Copy link
Contributor Author

rdolbeau commented Jun 21, 2022

I've updated one of the commit messages after (finally) testing the code on KNL.
The Skylake code won't run on KNL, but the KNL version (using scatter/gather) will run on Skylake with a small performance penalty. As KNL aren't exactly common, and the Skylake code is likely better for all later architectures as well, that's probably enough.

rdolbeau added 2 commits July 14, 2024 16:14
…assemble/disassemble the vector in 128 bits chunks.

This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL.
This should help with FFTW#143.
This should improves slightly the performance by reducing the number of uops needed to do the gather/scatter.
@rdolbeau
Copy link
Contributor Author

@stevengj @matteo-frigo Can I merge this (old) one? I don't think KNL not liking the new code will be much of a problem by now, as I think most of the KNL-based systems have been retired (and there's a configure option to produce a KNL-friendly version anyway, as I still do own a KNL myself :-) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants