Small performance improvement for avx-512 on Skylake-SP #191

rdolbeau · 2020-03-11T12:47:03Z

This improves performance a bit on the Skylake-SP cores (Xeon Scalable), by replacing gather/scatter by slightly more efficient code: breaking down the instruction in 128 bits chunk for DP, and going for 64-bits scatter/gather (instead of 32) in SP. The original gather/scatter code is still available for DP, as it's probably faster on Knights Landing (KNL, Xeon Phi 72xx). The SP code should be a win on KNL as well.

Tested with make check/bigcheck, and for performance on synthetic code, could probably use some real-life testing for performance.

xiegengxin

I think the scale should be 8 when using Scatter/Gather to access 2 32-bit elements as one 64-bit.

xiegengxin · 2021-07-07T05:54:18Z

simd-support/simd-avx512.h


-  return _mm512_i32gather_ps(index, x, 4);
+  return (V)_mm512_i32gather_pd(index, x, 4);


return (V)_mm512_i32gather_pd(index, x, 8);
Right?

Wrote that code a while ago, but I think 4 is correct; the indices are still referring to the original datatype - single precision value of 4 bytes. The_pd variant is used only to access 64 bits at a time explicitly.

You're right.

xiegengxin · 2021-07-07T05:54:40Z

simd-support/simd-avx512.h

+  /* pretend pair of single are a double */
+  const __m256i index = _mm256_set_epi32(7 * ovs, 6 * ovs, 5 * ovs, 4 * ovs, 3 * ovs, 2 * ovs, 1 * ovs, 0 * ovs);
+
+  _mm512_i32scatter_pd(x, index, (__m512d)v, 4);


_mm512_i32scatter_pd(x, index, (__m512d)v, 8);

Same here - ovs is a stride in 4-bytes elements, so the index vector is also in 4-bytes element.

rdolbeau · 2022-06-21T14:09:55Z

I've updated one of the commit messages after (finally) testing the code on KNL.
The Skylake code won't run on KNL, but the KNL version (using scatter/gather) will run on Skylake with a small performance penalty. As KNL aren't exactly common, and the Skylake code is likely better for all later architectures as well, that's probably enough.

…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL. This should help with FFTW#143.

This should improves slightly the performance by reducing the number of uops needed to do the gather/scatter.

rdolbeau · 2024-07-14T14:23:08Z

@stevengj @matteo-frigo Can I merge this (old) one? I don't think KNL not liking the new code will be much of a problem by now, as I think most of the KNL-based systems have been retired (and there's a configure option to produce a KNL-friendly version anyway, as I still do own a KNL myself :-) )

…e.ac

xiegengxin suggested changes Jul 7, 2021

View reviewed changes

rdolbeau force-pushed the avx512skx branch from 0cd86d0 to bf68241 Compare June 21, 2022 14:08

rdolbeau added 2 commits July 14, 2024 16:14

In AVX-512 LDu/STu, handle pair of single as a double.

c6cd1cd

This should improves slightly the performance by reducing the number of uops needed to do the gather/scatter.

rdolbeau force-pushed the avx512skx branch from bf68241 to c6cd1cd Compare July 14, 2024 14:14

new AVX512 code requires AVX512DQ, add the compiler flags in configur…

fd08d11

…e.ac

rdolbeau added the enhancement label Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small performance improvement for avx-512 on Skylake-SP #191

Small performance improvement for avx-512 on Skylake-SP #191

rdolbeau commented Mar 11, 2020

xiegengxin left a comment

xiegengxin Jul 7, 2021

rdolbeau Jul 7, 2021

xiegengxin Jul 7, 2021

xiegengxin Jul 7, 2021

rdolbeau Jul 7, 2021

rdolbeau commented Jun 21, 2022 •

edited

Loading

rdolbeau commented Jul 14, 2024


		return _mm512_i32gather_ps(index, x, 4);
		return (V)_mm512_i32gather_pd(index, x, 4);

Small performance improvement for avx-512 on Skylake-SP #191

Are you sure you want to change the base?

Small performance improvement for avx-512 on Skylake-SP #191

Conversation

rdolbeau commented Mar 11, 2020

xiegengxin left a comment

Choose a reason for hiding this comment

xiegengxin Jul 7, 2021

Choose a reason for hiding this comment

rdolbeau Jul 7, 2021

Choose a reason for hiding this comment

xiegengxin Jul 7, 2021

Choose a reason for hiding this comment

xiegengxin Jul 7, 2021

Choose a reason for hiding this comment

rdolbeau Jul 7, 2021

Choose a reason for hiding this comment

rdolbeau commented Jun 21, 2022 • edited Loading

rdolbeau commented Jul 14, 2024

rdolbeau commented Jun 21, 2022 •

edited

Loading