Skip to content

Conversation

@SongChujun
Copy link
Member

@SongChujun SongChujun commented Oct 10, 2025

Description

This PR add support for vectorized null suppression for block serde using Java SIMD API.

Functionalities added

detect SIMD support from the CPU.

This functionality is essential to prevent regression from happening, even though Java vector API is platform-agonitic, it only provides guarantess for correcntess, but no guarantee for performance improbement. If the JVM is running on a older CPU without decent SIMD support, the Java vector API may fall back to emulated execution instead of real SIMD execution. So we want a extra layer gating make sure that such fall back would not happen.

Currently, we add support for Intel and AMD CPUs, we may extend to support Graviton later if experiments can show speed up on Graviton machines.

add vectorized path for null suppression in block serde

Add vectorized path for null suppression for byte/short/integer/long.

Microbenchmark results are given below.

Microbenchmark on Intel CPU with avx512F support.

Microbenchmark on AMD zen4 CPU with avx512F support.

The reason that the speed up is not the potential maximum speed up(16x for Int, 8x for Long for AVX512) is

  1. We only optimize null supression part, which only accounts for a part of the latency for serialize pages that benchmark is testing(~55% for compressInts)
image
  1. The vectorized null suppression has a much higher L1d cache miss rate, 54.14% for vectorized and 6.56% for scalar. Such cache misses can easily wipe out SIMD’s performance gains.

Change row length to 8192 in BenchmarkBlockSerde to match real workload case

Since PAGE_SPLIT_THRESHOLD_IN_BYTES is 2 * 1024 * 1024 in PageSplitterUtil currently, the row length used in BenchmarkBlockSerde 10_000_000 is too long and doesn't match the real workload case. Profiling shows that under row length 10_000_000, the majority of time on BenchmarkBlockSerde is spent on this array creation
image

After the change
image

Tests:

  • Add TestEncoderUtil to verify scalar and vectorized compression outputs match

Next steps

  1. Currently, the gating on using vectorized code path is on machines with avx512 support only, we may want to extend this to avx2 support. This could open the opporunnties for AMD zen3 CPUs which only have avx2 support.
  2. Extend SIMD support to AWS Gravition machines
  3. Add SIMD support for other operations.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Improve performance of data exchanges by using SIMD instructions on x86 CPUs that support the required extensions. This can be disabled by setting `experimental.blockserde-vectorized-null-suppression-strategy=NONE`  ({issue}`26919`)

Summary by Sourcery

Add vectorized SIMD-based null suppression for block serialization using the Java Vector API with dynamic CPU feature detection

New Features:

  • Add compressBytesWithNulls, compressShortsWithNulls, compressIntsWithNulls, compressLongsWithNulls methods in EncoderUtil with vectorized and scalar paths
  • Introduce SimdSupport SPI interface and platform-specific implementations (Intel, AMD, Graviton) with SimdSupportManager for runtime detection
  • Refactor block encodings to delegate null filtering to new EncoderUtil methods
  • Add SimdInitializer for eager SIMD detection at server startup

Enhancements:

  • Expose EncoderUtil publicly and centralize null suppression logic
  • Reduce benchmark row count and initialize SIMD support in benchmarks and tests
  • Update module-info to require jdk.incubator.vector

Build:

  • Add SimdInitializer binding in Guice and include incubator vector module

Tests:

  • Add TestEncoderUtil to verify scalar and vectorized compression outputs match

Summary by Sourcery

Add vectorized null suppression for block serialization using the Java Vector API with dynamic CPU feature detection, refactor block encodings to leverage the new SIMD path, and update benchmarks and tests to validate correctness.

New Features:

  • Introduce SimdSupport SPI and runtime detection (SimdSupportManager) with Intel, AMD, and Graviton implementations
  • Add vectorized null suppression methods in EncoderUtil for byte, short, int, and long arrays
  • Add SimdInitializer to eagerly detect and install SIMD support at server startup

Enhancements:

  • Refactor existing block encodings to delegate null filtering to the new EncoderUtil.compress*WithNulls methods
  • Reduce row count in BenchmarkBlockSerde to match real workloads and ensure SIMD initialization

Tests:

  • Add TestEncoderUtil to verify scalar and vectorized compression outputs match
  • Initialize SIMD support in existing SPI block encoding tests

@cla-bot cla-bot bot added the cla-signed label Oct 10, 2025
@sourcery-ai
Copy link

sourcery-ai bot commented Oct 10, 2025

Reviewer's Guide

Introduce dynamic SIMD feature detection and vectorized null suppression in block serialization using Java Vector API, refactor block encodings to leverage the new API, adjust benchmarks/tests for SIMD support, and verify correctness with new unit tests.

Class diagram for new and updated SIMD support classes

classDiagram
    class SimdSupport {
        <<interface>>
        +boolean supportByteGeneric()
        +boolean supportShortGeneric()
        +boolean supportIntegerGeneric()
        +boolean supportLongGeneric()
        +boolean supportByteCompress()
        +boolean supportShortCompress()
        +boolean supportIntegerCompress()
        +boolean supportLongCompress()
        +static SimdSupport NONE
    }
    class IntelSimdSupport {
        +IntelSimdSupport(OSType)
        +boolean supportByteGeneric()
        +boolean supportShortGeneric()
        +boolean supportIntegerGeneric()
        +boolean supportLongGeneric()
        +boolean supportByteCompress()
        +boolean supportShortCompress()
        +boolean supportIntegerCompress()
        +boolean supportLongCompress()
    }
    class AmdSimdSupport {
        +AmdSimdSupport(OSType)
        +boolean supportByteGeneric()
        +boolean supportShortGeneric()
        +boolean supportIntegerGeneric()
        +boolean supportLongGeneric()
        +boolean supportByteCompress()
        +boolean supportShortCompress()
        +boolean supportIntegerCompress()
        +boolean supportLongCompress()
    }
    class GravitonSimdSupport {
        +GravitonSimdSupport(OSType)
    }
    SimdSupport <|.. IntelSimdSupport
    SimdSupport <|.. AmdSimdSupport
    SimdSupport <|.. GravitonSimdSupport
    class SimdSupportManager {
        +static void initialize()
        +static SimdSupport get()
        +static boolean isInitialized()
    }
    class SimdUtils {
        +static boolean isLinuxGraviton()
        +static Optional<String> linuxCpuVendorId()
        +static Set<String> readCpuFlags(OSType)
        +static String normalizeFlag(String)
    }
    class SimdInitializer {
        +SimdInitializer()
        +SimdSupport simdSupport()
    }
    SimdSupportManager --> SimdSupport
    SimdInitializer --> SimdSupportManager
    IntelSimdSupport --> OSType
    AmdSimdSupport --> OSType
    GravitonSimdSupport --> OSType
    SimdSupportManager --> TargetArch
    SimdSupportManager --> OSType
    SimdUtils --> OSType
Loading

Class diagram for updated EncoderUtil and block encoding classes

classDiagram
    class EncoderUtil {
        +static void setSimdSupport(SimdSupport)
        +static void compressBytesWithNulls(SliceOutput, byte[], boolean[], int, int)
        +static void compressShortsWithNulls(SliceOutput, short[], boolean[], int, int)
        +static void compressIntsWithNulls(SliceOutput, int[], boolean[], int, int)
        +static void compressLongsWithNulls(SliceOutput, long[], boolean[], int, int)
        -static void compressBytesWithNullsVectorized(...)
        -static void compressBytesWithNullsScalar(...)
        -static void compressShortsWithNullsVectorized(...)
        -static void compressShortsWithNullsScalar(...)
        -static void compressIntsWithNullsVectorized(...)
        -static void compressIntsWithNullsScalar(...)
        -static void compressLongsWithNullsVectorized(...)
        -static void compressLongsWithNullsScalar(...)
        +static SimdSupport simd
    }
    class ByteArrayBlockEncoding {
        +void writeBlock(...)
    }
    class ShortArrayBlockEncoding {
        +void writeBlock(...)
    }
    class IntArrayBlockEncoding {
        +void writeBlock(...)
    }
    class LongArrayBlockEncoding {
        +void writeBlock(...)
    }
    EncoderUtil <.. ByteArrayBlockEncoding : uses
    EncoderUtil <.. ShortArrayBlockEncoding : uses
    EncoderUtil <.. IntArrayBlockEncoding : uses
    EncoderUtil <.. LongArrayBlockEncoding : uses
Loading

File-Level Changes

Change Details Files
Runtime CPU SIMD detection framework
  • Define SimdSupport SPI interface
  • Add SimdUtils for CPU flag probing on Linux
  • Implement SimdSupportManager for target architecture detection
  • Provide platform-specific SimdSupport implementations (Intel, AMD, Graviton)
  • Introduce SimdInitializer and bind it eagerly in ServerMainModule
  • Require jdk.incubator.vector in module-info
  • Update Guice module for eager initialization
core/trino-spi/src/main/java/io/trino/spi/SimdSupport.java
core/trino-spi/src/main/java/io/trino/spi/simd/SimdUtils.java
core/trino-spi/src/main/java/io/trino/spi/simd/SimdSupportManager.java
core/trino-spi/src/main/java/io/trino/spi/simd/AmdSimdSupport.java
core/trino-spi/src/main/java/io/trino/spi/simd/IntelSimdSupport.java
core/trino-spi/src/main/java/io/trino/spi/simd/GravitonSimdSupport.java
core/trino-spi/src/main/java/io/trino/spi/simd/OSType.java
core/trino-spi/src/main/java/io/trino/spi/simd/TargetArch.java
core/trino-main/src/main/java/io/trino/simd/SimdInitializer.java
core/trino-main/src/main/java/io/trino/server/ServerMainModule.java
core/trino-spi/src/main/java/module-info.java
Vectorized null suppression API in EncoderUtil
  • Expose EncoderUtil publicly and add setSimdSupport hook
  • Introduce optimization threshold and path selection using SimdSupport
  • Implement compressBytes/Shorts/Ints/LongsWithNulls methods with scalar and vectorized logic
  • Use jdk.incubator.vector for mask loading and compress operations
core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java
Refactor block encodings to use the new Compressor API
  • Replace manual null filtering loops with EncoderUtil.compress*WithNulls calls in ByteArrayBlockEncoding
  • Apply same refactoring in IntArrayBlockEncoding, ShortArrayBlockEncoding, and LongArrayBlockEncoding
core/trino-spi/src/main/java/io/trino/spi/block/ByteArrayBlockEncoding.java
core/trino-spi/src/main/java/io/trino/spi/block/IntArrayBlockEncoding.java
core/trino-spi/src/main/java/io/trino/spi/block/ShortArrayBlockEncoding.java
core/trino-spi/src/main/java/io/trino/spi/block/LongArrayBlockEncoding.java
Update benchmarks and tests for SIMD support
  • Reduce row count to realistic 8192 in BenchmarkBlockSerde
  • Invoke SimdSupportManager.initialize() in BenchmarkBlockSerde, BaseBlockEncodingTest, and TestingBlockEncodingSerde static blocks
core/trino-main/src/test/java/io/trino/execution/buffer/BenchmarkBlockSerde.java
core/trino-spi/src/test/java/io/trino/spi/block/BaseBlockEncodingTest.java
core/trino-spi/src/test/java/io/trino/spi/block/TestingBlockEncodingSerde.java
Add TestEncoderUtil for scalar vs. vector correctness
  • Create TestEncoderUtil with random data generators
  • Verify that scalar and vectorized compress* implementations produce identical output for bytes, shorts, ints, and longs
core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@SongChujun SongChujun force-pushed the vectorized-null-suppression branch from 3cf2de1 to 5da70b5 Compare October 13, 2025 15:20
* treat as Graviton (covers Graviton2/3 where the model may be Neoverse N1/V1/V2).
*/
public static boolean isLinuxGraviton()
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a custom detection logic we should use first what https://github.com/oshi/oshi supports and only then fallback to custom parsing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add SIMD detection logic in SPI, I am less sure if we want to use some third party library since SPI should have very mimimal dependency, and the detection logic is rather simple that we are able to maintain.

Copy link
Contributor

@wendigo wendigo Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I don't think it should live in the SPI in the first place. These kind of optimizations in 99% use cases will live in the trino-main module which depends already on oshi. Having your own implementation, rather than relying on the external one - has its cost that we don't want to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason why this is needed in the SPI is the fact that BlockEncoding implementation live there but tbh concrete implementations shouldn't be a part of the trino-spi. BlockEncoding's can be a part of the plugin but built-in ones should be just moved to the trino-main.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we indeed need to think about the place where such logic should be placed, we need to guarantee it is accessible from lib(the case for vectorizedDecoding for parquet reader), SPI(tehe code for BlockEncoding, though it is detabale since we may move that to trino-main), trino-main, plugin etc.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `core/trino-spi/src/main/java/io/trino/spi/block/EncoderUtil.java:45` </location>
<code_context>
+
     private EncoderUtil() {}

+    public static void setSimdSupport(SimdSupport simdSupport)
+    {
+        simd = requireNonNull(simdSupport, "simdSupport is null");
</code_context>

<issue_to_address>
**issue (bug_risk):** setSimdSupport is public and mutable, which may allow unexpected reconfiguration at runtime.

Consider restricting the visibility of setSimdSupport or ensuring it cannot be called multiple times to prevent inconsistent state.
</issue_to_address>

### Comment 2
<location> `core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java:100-107` </location>
<code_context>
+        }
+    }
+
+    public static boolean[][] getIsNullArray(int length)
+    {
+        return new boolean[][] {
</code_context>

<issue_to_address>
**suggestion (testing):** Suggestion: Add more edge cases for isNull patterns.

Include cases with a single null at the start or end, and consecutive nulls in the middle, to improve test coverage.

```suggestion
    public static boolean[][] getIsNullArray(int length)
    {
        return new boolean[][] {
                all(false, length),
                all(true, length),
                alternating(length),
                randomBools(length),
                singleNullAtStart(length),
                singleNullAtEnd(length),
                consecutiveNullsInMiddle(length)};
    }

    private static boolean[] singleNullAtStart(int length)
    {
        boolean[] arr = new boolean[length];
        if (length > 0) {
            arr[0] = true;
        }
        return arr;
    }

    private static boolean[] singleNullAtEnd(int length)
    {
        boolean[] arr = new boolean[length];
        if (length > 0) {
            arr[length - 1] = true;
        }
        return arr;
    }

    private static boolean[] consecutiveNullsInMiddle(int length)
    {
        boolean[] arr = new boolean[length];
        if (length >= 4) {
            arr[length / 2 - 1] = true;
            arr[length / 2] = true;
        }
        return arr;
    }
```
</issue_to_address>

### Comment 3
<location> `core/trino-spi/src/test/java/io/trino/spi/block/TestEncoderUtil.java:49-50` </location>
<code_context>
+        }
+    }
+
+    @AfterAll
+    public static void resetSimd()
+    {
+        EncoderUtil.setSimdSupport(SimdSupport.NONE);
</code_context>

<issue_to_address>
**nitpick (testing):** Nitpick: Consider resetting SimdSupport before each test for isolation.

Resetting SimdSupport before each test, such as with @BeforeEach, will prevent state leakage between tests and improve reliability.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@pettyjamesm
Copy link
Member

Discussed this offline, from here we want to:

  1. Add a configuration property to explicitly disable vectorized block encoding logic (in case a performance regression occurs on some specific hardware in the wild)
  2. Put the operations that are either vectorized or scalar behind an interface, defined in trino-spi, e.g.: interface BlockEncodingSupport. The SPI defined implementation will be purely scalar, but the implementation bound from trino-main will be either vectorized or scalar depending on the configuration setting and the hardware support detection (which can be based on oshi at that point). @wendigo, does that strategy work for you?

@SongChujun SongChujun force-pushed the vectorized-null-suppression branch 2 times, most recently from e657e5a to c5eaac9 Compare October 15, 2025 20:10
@cla-bot
Copy link

cla-bot bot commented Oct 15, 2025

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: EC2 Default User.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label Oct 15, 2025
@SongChujun SongChujun force-pushed the vectorized-null-suppression branch from c5eaac9 to 5719c20 Compare October 15, 2025 21:01
@cla-bot cla-bot bot added the cla-signed label Oct 15, 2025
@SongChujun SongChujun force-pushed the vectorized-null-suppression branch 4 times, most recently from cc9407a to 5422d53 Compare October 17, 2025 15:03
addBlockEncoding(new ShortArrayBlockEncoding());
addBlockEncoding(new IntArrayBlockEncoding());
addBlockEncoding(new LongArrayBlockEncoding());
addBlockEncoding(new ByteArrayBlockEncoding(false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's prefer the vectorized implementation by default, even though the testing hardware may not support the specific instruction sets we're interested in. We'll rely on the JVM fallback implementations under the assumption that those should match the semantics of vectorization (even though they'll perform worse) and give us better coverage of the relevant codepaths. (also add an inline comment in the code to note this assumption).

@SongChujun SongChujun force-pushed the vectorized-null-suppression branch 2 times, most recently from 3c4ad5f to 62fb687 Compare October 17, 2025 20:42
Comment on lines 56 to 99
ProcessorIdentifier id = MachineInfo.getProcessorInfo();

String vendor = id.getVendor().toLowerCase(ENGLISH);

if (vendor.contains("intel") || vendor.contains("amd")) {
return detectX86SimdSupport();
}

return SimdSupport.NONE;
}

private static SimdSupport detectX86SimdSupport()
{
enum X86Isa {
avx512f,
avx512vbmi2
}

Set<String> flags = readCpuFlags();
EnumSet<X86Isa> x86Flags = EnumSet.noneOf(X86Isa.class);

if (!flags.isEmpty()) {
for (X86Isa isa : X86Isa.values()) {
if (flags.contains(isa.name())) {
x86Flags.add(isa);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just use the same technique we used in io.trino.parquet.reader.ColumnReaderFactory#isVectorizedDecodingSupported. My memory is that from testeing they found that ARM Neon was slower, so basincally we just disable vector instructions unless PREFERRED_BIT_WIDTH >= 256.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, for this purpose we need specifically to detect AVX512F (for VPCOMPRESSD / VPCOMPRESSQ instruction support for int and long types) and AVX512VBMI2 (for VPCOMPRESSB / VPCOMPRESSW instruction support over byte and short types).

  1. AVX512F is fairly widely supported starting in Intel Xeon Skylake CPUs and Zen 4 CPUs
  2. AVX512 VBMI2 support starts in Intel Icelake and AMD Zen 4 CPUs

The real thing we would like to check here is whether Vector<T>#compress(VectorMask<T>) is supported natively in hardware or emulated by the JVM- because the emulated support is so much slower than the simple scalar code that exists, but since we don't have the ability to detect that directly from the JDK vector API we have to assume that native support exists whenever the CPU advertises it

<old>method long[] io.trino.spi.PageSorter::sort(java.util.List&lt;io.trino.spi.type.Type&gt;, java.util.List&lt;io.trino.spi.Page&gt;, java.util.List&lt;java.lang.Integer&gt;, java.util.List&lt;io.trino.spi.connector.SortOrder&gt;, int)</old>
<new>method java.util.Iterator&lt;io.trino.spi.Page&gt; io.trino.spi.PageSorter::sort(java.util.List&lt;io.trino.spi.type.Type&gt;, java.util.List&lt;io.trino.spi.Page&gt;, java.util.List&lt;java.lang.Integer&gt;, java.util.List&lt;io.trino.spi.connector.SortOrder&gt;, int)</new>
</item>
<item>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd like to do this change without messing with the SPI. My experience witht he vector stuff is that it either works or doesn't, so maybe we can just have a global kill switch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more or less what you're looking at here. Detection for what the hardware supports has to happen from within trino-main and then get passed as a boolean parameter to the block encoding constructors. Hardware detection is combined with a kill switch to disable the feature if needed- but the block encoder constructors need to change either way to accept the result of config + hardware detection.

@SongChujun SongChujun force-pushed the vectorized-null-suppression branch from 62fb687 to 8fa4cfe Compare October 23, 2025 15:04
@SongChujun SongChujun force-pushed the vectorized-null-suppression branch from 8fa4cfe to d6474d8 Compare October 23, 2025 17:04
@pettyjamesm
Copy link
Member

Changes and approach overall look good to me at this point, with the following items gating final approval / merge:

  1. Cleanup the commit history so that we don't introduce SimdSupport under trino-spi in the first commit only to move it in the second commit back to trino-main.
  2. Get additional maintainer review / approval since this is a significant extension to the current vector API usage in Trino. cc: @raunaqmorarka / @dain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants