-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add support for vectorized null suppression for block serde #26919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideIntroduce dynamic SIMD feature detection and vectorized null suppression in block serialization using Java Vector API, refactor block encodings to leverage the new API, adjust benchmarks/tests for SIMD support, and verify correctness with new unit tests. Class diagram for new and updated SIMD support classesclassDiagram
class SimdSupport {
<<interface>>
+boolean supportByteGeneric()
+boolean supportShortGeneric()
+boolean supportIntegerGeneric()
+boolean supportLongGeneric()
+boolean supportByteCompress()
+boolean supportShortCompress()
+boolean supportIntegerCompress()
+boolean supportLongCompress()
+static SimdSupport NONE
}
class IntelSimdSupport {
+IntelSimdSupport(OSType)
+boolean supportByteGeneric()
+boolean supportShortGeneric()
+boolean supportIntegerGeneric()
+boolean supportLongGeneric()
+boolean supportByteCompress()
+boolean supportShortCompress()
+boolean supportIntegerCompress()
+boolean supportLongCompress()
}
class AmdSimdSupport {
+AmdSimdSupport(OSType)
+boolean supportByteGeneric()
+boolean supportShortGeneric()
+boolean supportIntegerGeneric()
+boolean supportLongGeneric()
+boolean supportByteCompress()
+boolean supportShortCompress()
+boolean supportIntegerCompress()
+boolean supportLongCompress()
}
class GravitonSimdSupport {
+GravitonSimdSupport(OSType)
}
SimdSupport <|.. IntelSimdSupport
SimdSupport <|.. AmdSimdSupport
SimdSupport <|.. GravitonSimdSupport
class SimdSupportManager {
+static void initialize()
+static SimdSupport get()
+static boolean isInitialized()
}
class SimdUtils {
+static boolean isLinuxGraviton()
+static Optional<String> linuxCpuVendorId()
+static Set<String> readCpuFlags(OSType)
+static String normalizeFlag(String)
}
class SimdInitializer {
+SimdInitializer()
+SimdSupport simdSupport()
}
SimdSupportManager --> SimdSupport
SimdInitializer --> SimdSupportManager
IntelSimdSupport --> OSType
AmdSimdSupport --> OSType
GravitonSimdSupport --> OSType
SimdSupportManager --> TargetArch
SimdSupportManager --> OSType
SimdUtils --> OSType
Class diagram for updated EncoderUtil and block encoding classesclassDiagram
class EncoderUtil {
+static void setSimdSupport(SimdSupport)
+static void compressBytesWithNulls(SliceOutput, byte[], boolean[], int, int)
+static void compressShortsWithNulls(SliceOutput, short[], boolean[], int, int)
+static void compressIntsWithNulls(SliceOutput, int[], boolean[], int, int)
+static void compressLongsWithNulls(SliceOutput, long[], boolean[], int, int)
-static void compressBytesWithNullsVectorized(...)
-static void compressBytesWithNullsScalar(...)
-static void compressShortsWithNullsVectorized(...)
-static void compressShortsWithNullsScalar(...)
-static void compressIntsWithNullsVectorized(...)
-static void compressIntsWithNullsScalar(...)
-static void compressLongsWithNullsVectorized(...)
-static void compressLongsWithNullsScalar(...)
+static SimdSupport simd
}
class ByteArrayBlockEncoding {
+void writeBlock(...)
}
class ShortArrayBlockEncoding {
+void writeBlock(...)
}
class IntArrayBlockEncoding {
+void writeBlock(...)
}
class LongArrayBlockEncoding {
+void writeBlock(...)
}
EncoderUtil <.. ByteArrayBlockEncoding : uses
EncoderUtil <.. ShortArrayBlockEncoding : uses
EncoderUtil <.. IntArrayBlockEncoding : uses
EncoderUtil <.. LongArrayBlockEncoding : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
3cf2de1 to
5da70b5
Compare
| * treat as Graviton (covers Graviton2/3 where the model may be Neoverse N1/V1/V2). | ||
| */ | ||
| public static boolean isLinuxGraviton() | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a custom detection logic we should use first what https://github.com/oshi/oshi supports and only then fallback to custom parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add SIMD detection logic in SPI, I am less sure if we want to use some third party library since SPI should have very mimimal dependency, and the detection logic is rather simple that we are able to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I don't think it should live in the SPI in the first place. These kind of optimizations in 99% use cases will live in the trino-main module which depends already on oshi. Having your own implementation, rather than relying on the external one - has its cost that we don't want to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason why this is needed in the SPI is the fact that BlockEncoding implementation live there but tbh concrete implementations shouldn't be a part of the trino-spi. BlockEncoding's can be a part of the plugin but built-in ones should be just moved to the trino-main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we indeed need to think about the place where such logic should be placed, we need to guarantee it is accessible from lib(the case for vectorizedDecoding for parquet reader), SPI(tehe code for BlockEncoding, though it is detabale since we may move that to trino-main), trino-main, plugin etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java:45` </location>
<code_context>
+
private EncoderUtil() {}
+ public static void setSimdSupport(SimdSupport simdSupport)
+ {
+ simd = requireNonNull(simdSupport, "simdSupport is null");
</code_context>
<issue_to_address>
**issue (bug_risk):** setSimdSupport is public and mutable, which may allow unexpected reconfiguration at runtime.
Consider restricting the visibility of setSimdSupport or ensuring it cannot be called multiple times to prevent inconsistent state.
</issue_to_address>
### Comment 2
<location> `core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java:100-107` </location>
<code_context>
+ }
+ }
+
+ public static boolean[][] getIsNullArray(int length)
+ {
+ return new boolean[][] {
</code_context>
<issue_to_address>
**suggestion (testing):** Suggestion: Add more edge cases for isNull patterns.
Include cases with a single null at the start or end, and consecutive nulls in the middle, to improve test coverage.
```suggestion
public static boolean[][] getIsNullArray(int length)
{
return new boolean[][] {
all(false, length),
all(true, length),
alternating(length),
randomBools(length),
singleNullAtStart(length),
singleNullAtEnd(length),
consecutiveNullsInMiddle(length)};
}
private static boolean[] singleNullAtStart(int length)
{
boolean[] arr = new boolean[length];
if (length > 0) {
arr[0] = true;
}
return arr;
}
private static boolean[] singleNullAtEnd(int length)
{
boolean[] arr = new boolean[length];
if (length > 0) {
arr[length - 1] = true;
}
return arr;
}
private static boolean[] consecutiveNullsInMiddle(int length)
{
boolean[] arr = new boolean[length];
if (length >= 4) {
arr[length / 2 - 1] = true;
arr[length / 2] = true;
}
return arr;
}
```
</issue_to_address>
### Comment 3
<location> `core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java:49-50` </location>
<code_context>
+ }
+ }
+
+ @AfterAll
+ public static void resetSimd()
+ {
+ EncoderUtil.setSimdSupport(SimdSupport.NONE);
</code_context>
<issue_to_address>
**nitpick (testing):** Nitpick: Consider resetting SimdSupport before each test for isolation.
Resetting SimdSupport before each test, such as with @BeforeEach, will prevent state leakage between tests and improve reliability.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java
Outdated
Show resolved
Hide resolved
|
Discussed this offline, from here we want to:
|
e657e5a to
c5eaac9
Compare
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: EC2 Default User.
|
c5eaac9 to
5719c20
Compare
core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java
Outdated
Show resolved
Hide resolved
cc9407a to
5422d53
Compare
core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java
Outdated
Show resolved
Hide resolved
core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java
Outdated
Show resolved
Hide resolved
| addBlockEncoding(new ShortArrayBlockEncoding()); | ||
| addBlockEncoding(new IntArrayBlockEncoding()); | ||
| addBlockEncoding(new LongArrayBlockEncoding()); | ||
| addBlockEncoding(new ByteArrayBlockEncoding(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's prefer the vectorized implementation by default, even though the testing hardware may not support the specific instruction sets we're interested in. We'll rely on the JVM fallback implementations under the assumption that those should match the semantics of vectorization (even though they'll perform worse) and give us better coverage of the relevant codepaths. (also add an inline comment in the code to note this assumption).
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/SimdSupportManager.java
Outdated
Show resolved
Hide resolved
3c4ad5f to
62fb687
Compare
| ProcessorIdentifier id = MachineInfo.getProcessorInfo(); | ||
|
|
||
| String vendor = id.getVendor().toLowerCase(ENGLISH); | ||
|
|
||
| if (vendor.contains("intel") || vendor.contains("amd")) { | ||
| return detectX86SimdSupport(); | ||
| } | ||
|
|
||
| return SimdSupport.NONE; | ||
| } | ||
|
|
||
| private static SimdSupport detectX86SimdSupport() | ||
| { | ||
| enum X86Isa { | ||
| avx512f, | ||
| avx512vbmi2 | ||
| } | ||
|
|
||
| Set<String> flags = readCpuFlags(); | ||
| EnumSet<X86Isa> x86Flags = EnumSet.noneOf(X86Isa.class); | ||
|
|
||
| if (!flags.isEmpty()) { | ||
| for (X86Isa isa : X86Isa.values()) { | ||
| if (flags.contains(isa.name())) { | ||
| x86Flags.add(isa); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just use the same technique we used in io.trino.parquet.reader.ColumnReaderFactory#isVectorizedDecodingSupported. My memory is that from testeing they found that ARM Neon was slower, so basincally we just disable vector instructions unless PREFERRED_BIT_WIDTH >= 256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, for this purpose we need specifically to detect AVX512F (for VPCOMPRESSD / VPCOMPRESSQ instruction support for int and long types) and AVX512VBMI2 (for VPCOMPRESSB / VPCOMPRESSW instruction support over byte and short types).
- AVX512F is fairly widely supported starting in Intel Xeon Skylake CPUs and Zen 4 CPUs
- AVX512 VBMI2 support starts in Intel Icelake and AMD Zen 4 CPUs
The real thing we would like to check here is whether Vector<T>#compress(VectorMask<T>) is supported natively in hardware or emulated by the JVM- because the emulated support is so much slower than the simple scalar code that exists, but since we don't have the ability to detect that directly from the JDK vector API we have to assume that native support exists whenever the CPU advertises it
| <old>method long[] io.trino.spi.PageSorter::sort(java.util.List<io.trino.spi.type.Type>, java.util.List<io.trino.spi.Page>, java.util.List<java.lang.Integer>, java.util.List<io.trino.spi.connector.SortOrder>, int)</old> | ||
| <new>method java.util.Iterator<io.trino.spi.Page> io.trino.spi.PageSorter::sort(java.util.List<io.trino.spi.type.Type>, java.util.List<io.trino.spi.Page>, java.util.List<java.lang.Integer>, java.util.List<io.trino.spi.connector.SortOrder>, int)</new> | ||
| </item> | ||
| <item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd like to do this change without messing with the SPI. My experience witht he vector stuff is that it either works or doesn't, so maybe we can just have a global kill switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more or less what you're looking at here. Detection for what the hardware supports has to happen from within trino-main and then get passed as a boolean parameter to the block encoding constructors. Hardware detection is combined with a kill switch to disable the feature if needed- but the block encoder constructors need to change either way to accept the result of config + hardware detection.
core/trino-main/src/main/java/io/trino/simd/BlockEncodingSimdSupport.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/BlockEncodingSimdSupport.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/simd/BlockEncodingSimdSupport.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/test/java/io/trino/server/remotetask/TestHttpRemoteTask.java
Outdated
Show resolved
Hide resolved
62fb687 to
8fa4cfe
Compare
8fa4cfe to
d6474d8
Compare
|
Changes and approach overall look good to me at this point, with the following items gating final approval / merge:
|
Description
This PR add support for vectorized null suppression for block serde using Java SIMD API.
Functionalities added
detect SIMD support from the CPU.
This functionality is essential to prevent regression from happening, even though Java vector API is platform-agonitic, it only provides guarantess for correcntess, but no guarantee for performance improbement. If the JVM is running on a older CPU without decent SIMD support, the Java vector API may fall back to emulated execution instead of real SIMD execution. So we want a extra layer gating make sure that such fall back would not happen.
Currently, we add support for Intel and AMD CPUs, we may extend to support Graviton later if experiments can show speed up on Graviton machines.
add vectorized path for null suppression in block serde
Add vectorized path for null suppression for byte/short/integer/long.
Microbenchmark results are given below.
Microbenchmark on Intel CPU with avx512F support.
Microbenchmark on AMD zen4 CPU with avx512F support.
The reason that the speed up is not the potential maximum speed up(16x for Int, 8x for Long for AVX512) is
Change row length to 8192 in BenchmarkBlockSerde to match real workload case
Since

PAGE_SPLIT_THRESHOLD_IN_BYTESis 2 * 1024 * 1024 in PageSplitterUtil currently, the row length used in BenchmarkBlockSerde 10_000_000 is too long and doesn't match the real workload case. Profiling shows that under row length 10_000_000, the majority of time on BenchmarkBlockSerde is spent on this array creationAfter the change

Tests:
Next steps
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:
Summary by Sourcery
Add vectorized SIMD-based null suppression for block serialization using the Java Vector API with dynamic CPU feature detection
New Features:
Enhancements:
Build:
Tests:
Summary by Sourcery
Add vectorized null suppression for block serialization using the Java Vector API with dynamic CPU feature detection, refactor block encodings to leverage the new SIMD path, and update benchmarks and tests to validate correctness.
New Features:
Enhancements:
Tests: