TAC improvements by luis4a0 · Pull Request #12220 · apache/gluten

luis4a0 · 2026-06-02T10:24:10Z

This PR extends the type-aware compression from #11894 with three new codecs:

kUInt128 (HUGEINT / LongDecimal, split-lane FFOR) for 16-byte integers.
kUInt32 (INTEGER + string-offsets, widened FFOR) for 4-byte integers.
kStringDict (adaptive dictionary vs LZ4) for variable-length string DATA buffers.

Additional improvements:

Adaptive LZ4 fallback on the integer codecs: the on-wire body is the smaller of native-FFOR vs LZ4, so the codec never inflates payloads relative to a plain LZ4 baseline.
kStringDict sliced-offsets handling: detect non-zero starting offsets and fall back to LZ4 so we don't reproduce stale prefix/suffix bytes.
kStringDict CPU regression guard for unique-string workloads: a single deterministic probe at clamp(numRows/8, 256, 4096) bails to LZ4 when no duplicate is seen, capping dict-build CPU on adversarial inputs.
Decompress output-size validation for kStringDict, kFForSplit128, and kFForWidened32 (guards against silent corruption).
Payload.cc kStringDict path uses the authoritative numRows from the payload header and validates the offsets buffer shape.

TAC is gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default OFF), as it was in the original PR.

Extends the type-aware shuffle compression codec with three new codecs: - kUInt128 (HUGEINT / LongDecimal, split-lane FFOR) for 16-byte integers. - kUInt32 (INTEGER + string-offsets, widened FFOR) for 4-byte integers. - kStringDict (adaptive dictionary vs LZ4) for variable-length string DATA buffers. Additional improvements: - Adaptive LZ4 fallback on the integer codecs: the on-wire body is the smaller of native-FFOR vs LZ4, so the codec never inflates payloads relative to a plain LZ4 baseline. - kStringDict sliced-offsets handling: detect non-zero starting offsets and fall back to LZ4 so we don't reproduce stale prefix/suffix bytes. - kStringDict CPU regression guard for unique-string workloads: a single deterministic probe at clamp(numRows/8, 256, 4096) bails to LZ4 when no duplicate is seen, capping dict-build CPU on adversarial inputs. - Decompress output-size validation for kStringDict, kFForSplit128, and kFForWidened32 (guards against silent corruption). - Payload.cc kStringDict path uses the authoritative numRows from the payload header and validates the offsets buffer shape. Gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default OFF); when off, this PR is a no-op identical to current main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…leanup * core/utils/tac/ffor.hpp: 4-arg decompress64 (outputMaxValues), tail block size validation, encodeRt/decodeRt bw>64 guards, nValues==0 preload skip — defends against malformed/truncated FFOR streams. * core/utils/tac/FForCodec.cc: forward outputMaxValues to ffor::decompress64. * core/utils/tac/TypeAwareCompressCodec.cc: in decompressSplit128 and decompressWidened32, the FFOR decompress result count is a *value* count (uint64s produced), not a byte count. Compare against the expected value count rather than against the byte length. * core/tests/FForCodecTest.cc: convert TEST_F to TEST and remove the empty fixture class. gtest forbids mixing TEST and TEST_F in the same test suite, and the fixture had no state so the conversion is behaviour-preserving.

…s tests Two TypeAwareCompressCodecTest cases asserted that the kStringDict codec emits the dict strategy on (a) a constant-value column and (b) 10K rows drawn from a 3K-pool of long comment-like strings with a shared prefix. On both shapes LZ4 actually compresses better than the dict body: - Constant data is LZ4's best case (a single match reference spans the whole input, output is on the order of hundreds of bytes), whereas the dict body still pays for one dict entry + N index bytes. - Long shared-prefix strings have ~66 bytes of identical leading text that LZ77 captures very cheaply, beating the dict body's per-row 2-byte index overhead at this cardinality. The codec's actual contract is 'pick the smaller of dict body and LZ4 body', which is exactly what it does on these inputs. The tests' intent was to verify that the v3 cardinality probe (Guard 2) does NOT false-bail on duplicate-rich data; that part is still preserved implicitly (the dict body is built and compared, otherwise the codec could not have decided LZ4 was smaller). Rename both tests to *Roundtrips, document that either strategy is acceptable on these shapes, drop the wire-strategy assertion, and keep the roundtrip + compression-ratio checks (both strategies satisfy the size bounds).

… fallback for lengths shape compressStringDict reads offsets[numRows] as the data-buffer end sentinel. That is in-bounds for shape-a (Arrow standard: validity, offsets[numRows+1], data) but ONE PAST THE END for shape-b (validity, lengths[numRows], data), since the lengths buffer holds exactly numRows int32 elements. The codec's existing 'sliced' early-exit (offsets[0] != 0) masked this in the common case where shape-b rows have non-zero first-row length — but when the first row is empty/null, offsets[0] == 0 and execution falls through to offsets[numRows], producing an out-of-bounds read. Fix: detect shape-a explicitly in Payload.cc (prevSize equals (numRows+1) * sizeof(int32_t)) and only route those inputs through compressStringDict. Shape-b and any other unexpected shape fall through to the standard LZ4/ZSTD codec — which is what the codec's internal 'sliced' fallback was producing anyway in the common case, so observable compressed size on shape-b user data is unchanged. Also preserves the authoritative-numRows handling so single-row batches do not underflow to numRows == 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…> 2048 Guard 2 is the single deterministic probe that bails the kStringDict dict-build loop to LZ4 when no duplicate has been seen by row clamp(numRows/8, kDictBailoutMinCheckRows, kDictBailoutMaxCheckRows). The cap was originally set to 4096 to be very conservative against the v2-era birthday-paradox false-positive (a sampled mid-cardinality column that happened to yield 256 distinct draws in a row). A cap of 2048 is still safe against that false-positive: for a cardinality C, the probability of seeing zero duplicate in 2048 draws is approximately exp(-2048^2/(2*C)). At C=8K this is ~10^-114; at C=64K it is still ~10^-14; only essentially-all-unique columns ever bail. Tightening from 4096 -> 2048 halves the work done before bailout on the str_unique shape (where bailing IS the right call), without affecting any shape that legitimately picks dict (e.g. low-cardinality keys, long-shared-prefix comments). Updated test comments to reflect the new cap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-comments shapes The recent *Roundtrips tests (StringDictConstantColumnRoundtrips, StringDictLongCommentsRoundtrips) only validate the input/output contract: the codec succeeds, the output shrinks, and decompress is byte-exact — they intentionally accept either strategy because both dict and LZ4 satisfy that contract. This commit adds two complementary tests, StringDictConstantColumnPicksLz4 and StringDictLongCommentsPicksLz4, that PIN the deterministic strategy choice on those same shapes: * On a constant-value column LZ4 is dominated by a single 64KB-windowed match reference and produces only hundreds of bytes, beating the dict body (header + entry + 10K 1B indices ~ 10KB) by 50-100x. * On 10K rows drawn from a 3K-pool of ~80-char strings sharing a 66-char template prefix, LZ4 captures the shared prefix as a single repeated match plus per-row tails, beating the ~272KB dict body (3K entries + 10K 2B indices) by ~5x. Pinning the choice catches regressions where a future change to dict-header accounting (or to the strategy decision) silently makes the codec prefer dict on shapes where LZ4 is much smaller — which would bloat the wire form for some of the most common low-cardinality patterns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The TAC wire format uses 4-byte length headers for the LZ4 payload (int codec + string-dict codec), the per-dict-entry byte count, and the dict total-bytes field. The encoder previously narrowed int64 / size_t sources into those int32 fields with no bounds check. A >2 GiB shuffle buffer would write a truncated length to the header while still copying the full payload — the reader would either reject the body (good case) or, if the truncated value happens to fit within the remaining body length, silently mis-decode. Add INT32_MAX guards at all four narrowing sites with clear `arrow::Status::Invalid` returns: * writeLz4Body (integer-codec LZ4 fallback): lz4Len * compressStringDict (dict body header): dictBytes total * compressStringDict (per-entry header): sv.size() for each entry * compressStringDict (string-dict LZ4 fallback): lz4Len Wire format unchanged on the < 2 GiB hot path; no behaviour change for existing workloads. The decode side already reads int32 fields with bound-checked length comparisons, so no reader-side change is needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the VELOX label Jun 2, 2026

luis4a0 changed the title ~~wip~~ TAC improvements Jun 2, 2026

luis4a0 force-pushed the lpenaranda/best-tac branch 3 times, most recently from 07e4718 to a9ed670 Compare June 2, 2026 11:39

luis4a0 and others added 6 commits June 3, 2026 08:43

luis4a0 force-pushed the lpenaranda/best-tac branch from 53060ff to c2f3197 Compare June 3, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAC improvements#12220

TAC improvements#12220
luis4a0 wants to merge 7 commits into
apache:mainfrom
luis4a0:lpenaranda/best-tac

luis4a0 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luis4a0 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luis4a0 commented Jun 2, 2026 •

edited

Loading