Skip to content

TAC improvements#12220

Draft
luis4a0 wants to merge 7 commits into
apache:mainfrom
luis4a0:lpenaranda/best-tac
Draft

TAC improvements#12220
luis4a0 wants to merge 7 commits into
apache:mainfrom
luis4a0:lpenaranda/best-tac

Conversation

@luis4a0
Copy link
Copy Markdown
Contributor

@luis4a0 luis4a0 commented Jun 2, 2026

This PR extends the type-aware compression from #11894 with three new codecs:

  • kUInt128 (HUGEINT / LongDecimal, split-lane FFOR) for 16-byte integers.
  • kUInt32 (INTEGER + string-offsets, widened FFOR) for 4-byte integers.
  • kStringDict (adaptive dictionary vs LZ4) for variable-length string DATA buffers.

Additional improvements:

  • Adaptive LZ4 fallback on the integer codecs: the on-wire body is the smaller of native-FFOR vs LZ4, so the codec never inflates payloads relative to a plain LZ4 baseline.
  • kStringDict sliced-offsets handling: detect non-zero starting offsets and fall back to LZ4 so we don't reproduce stale prefix/suffix bytes.
  • kStringDict CPU regression guard for unique-string workloads: a single deterministic probe at clamp(numRows/8, 256, 4096) bails to LZ4 when no duplicate is seen, capping dict-build CPU on adversarial inputs.
  • Decompress output-size validation for kStringDict, kFForSplit128, and kFForWidened32 (guards against silent corruption).
  • Payload.cc kStringDict path uses the authoritative numRows from the payload header and validates the offsets buffer shape.

TAC is gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default OFF), as it was in the original PR.

@github-actions github-actions Bot added the VELOX label Jun 2, 2026
@luis4a0 luis4a0 changed the title wip TAC improvements Jun 2, 2026
@luis4a0 luis4a0 force-pushed the lpenaranda/best-tac branch 3 times, most recently from 07e4718 to a9ed670 Compare June 2, 2026 11:39
luis4a0 and others added 6 commits June 3, 2026 08:43
Extends the type-aware shuffle compression codec with three new codecs:

  - kUInt128 (HUGEINT / LongDecimal, split-lane FFOR) for 16-byte integers.
  - kUInt32 (INTEGER + string-offsets, widened FFOR) for 4-byte integers.
  - kStringDict (adaptive dictionary vs LZ4) for variable-length string
    DATA buffers.

Additional improvements:

  - Adaptive LZ4 fallback on the integer codecs: the on-wire body is the
    smaller of native-FFOR vs LZ4, so the codec never inflates payloads
    relative to a plain LZ4 baseline.
  - kStringDict sliced-offsets handling: detect non-zero starting offsets
    and fall back to LZ4 so we don't reproduce stale prefix/suffix bytes.
  - kStringDict CPU regression guard for unique-string workloads: a single
    deterministic probe at clamp(numRows/8, 256, 4096) bails to LZ4 when
    no duplicate is seen, capping dict-build CPU on adversarial inputs.
  - Decompress output-size validation for kStringDict, kFForSplit128, and
    kFForWidened32 (guards against silent corruption).
  - Payload.cc kStringDict path uses the authoritative numRows from the
    payload header and validates the offsets buffer shape.

Gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled
(default OFF); when off, this PR is a no-op identical to current main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…leanup

* core/utils/tac/ffor.hpp: 4-arg decompress64 (outputMaxValues), tail
  block size validation, encodeRt/decodeRt bw>64 guards, nValues==0
  preload skip — defends against malformed/truncated FFOR streams.
* core/utils/tac/FForCodec.cc: forward outputMaxValues to ffor::decompress64.
* core/utils/tac/TypeAwareCompressCodec.cc: in decompressSplit128 and
  decompressWidened32, the FFOR decompress result count is a *value*
  count (uint64s produced), not a byte count. Compare against the
  expected value count rather than against the byte length.
* core/tests/FForCodecTest.cc: convert TEST_F to TEST and remove the
  empty fixture class. gtest forbids mixing TEST and TEST_F in the
  same test suite, and the fixture had no state so the conversion is
  behaviour-preserving.
…s tests

Two TypeAwareCompressCodecTest cases asserted that the kStringDict codec
emits the dict strategy on (a) a constant-value column and (b) 10K rows
drawn from a 3K-pool of long comment-like strings with a shared prefix.

On both shapes LZ4 actually compresses better than the dict body:
- Constant data is LZ4's best case (a single match reference spans the
  whole input, output is on the order of hundreds of bytes), whereas the
  dict body still pays for one dict entry + N index bytes.
- Long shared-prefix strings have ~66 bytes of identical leading text that
  LZ77 captures very cheaply, beating the dict body's per-row 2-byte
  index overhead at this cardinality.

The codec's actual contract is 'pick the smaller of dict body and LZ4
body', which is exactly what it does on these inputs. The tests'
intent was to verify that the v3 cardinality probe (Guard 2) does NOT
false-bail on duplicate-rich data; that part is still preserved
implicitly (the dict body is built and compared, otherwise the codec
could not have decided LZ4 was smaller).

Rename both tests to *Roundtrips, document that either strategy is
acceptable on these shapes, drop the wire-strategy assertion, and keep
the roundtrip + compression-ratio checks (both strategies satisfy the
size bounds).
… fallback for lengths shape

compressStringDict reads offsets[numRows] as the data-buffer end sentinel.
That is in-bounds for shape-a (Arrow standard: validity, offsets[numRows+1],
data) but ONE PAST THE END for shape-b (validity, lengths[numRows], data),
since the lengths buffer holds exactly numRows int32 elements.

The codec's existing 'sliced' early-exit (offsets[0] != 0) masked this in
the common case where shape-b rows have non-zero first-row length — but
when the first row is empty/null, offsets[0] == 0 and execution falls
through to offsets[numRows], producing an out-of-bounds read.

Fix: detect shape-a explicitly in Payload.cc (prevSize equals
(numRows+1) * sizeof(int32_t)) and only route those inputs through
compressStringDict. Shape-b and any other unexpected shape fall through
to the standard LZ4/ZSTD codec — which is what the codec's internal
'sliced' fallback was producing anyway in the common case, so observable
compressed size on shape-b user data is unchanged.

Also preserves the authoritative-numRows handling so single-row batches
do not underflow to numRows == 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…> 2048

Guard 2 is the single deterministic probe that bails the kStringDict
dict-build loop to LZ4 when no duplicate has been seen by row
clamp(numRows/8, kDictBailoutMinCheckRows, kDictBailoutMaxCheckRows).
The cap was originally set to 4096 to be very conservative against the
v2-era birthday-paradox false-positive (a sampled mid-cardinality
column that happened to yield 256 distinct draws in a row).

A cap of 2048 is still safe against that false-positive: for a
cardinality C, the probability of seeing zero duplicate in 2048 draws
is approximately exp(-2048^2/(2*C)). At C=8K this is ~10^-114; at C=64K
it is still ~10^-14; only essentially-all-unique columns ever bail.

Tightening from 4096 -> 2048 halves the work done before bailout on the
str_unique shape (where bailing IS the right call), without affecting
any shape that legitimately picks dict (e.g. low-cardinality keys,
long-shared-prefix comments). Updated test comments to reflect the new
cap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-comments shapes

The recent *Roundtrips tests (StringDictConstantColumnRoundtrips,
StringDictLongCommentsRoundtrips) only validate the input/output
contract: the codec succeeds, the output shrinks, and decompress is
byte-exact — they intentionally accept either strategy because both
dict and LZ4 satisfy that contract.

This commit adds two complementary tests, StringDictConstantColumnPicksLz4
and StringDictLongCommentsPicksLz4, that PIN the deterministic strategy
choice on those same shapes:

* On a constant-value column LZ4 is dominated by a single 64KB-windowed
  match reference and produces only hundreds of bytes, beating the dict
  body (header + entry + 10K 1B indices ~ 10KB) by 50-100x.
* On 10K rows drawn from a 3K-pool of ~80-char strings sharing a 66-char
  template prefix, LZ4 captures the shared prefix as a single repeated
  match plus per-row tails, beating the ~272KB dict body (3K entries +
  10K 2B indices) by ~5x.

Pinning the choice catches regressions where a future change to
dict-header accounting (or to the strategy decision) silently makes the
codec prefer dict on shapes where LZ4 is much smaller — which would
bloat the wire form for some of the most common low-cardinality patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/best-tac branch from 53060ff to c2f3197 Compare June 3, 2026 08:44
The TAC wire format uses 4-byte length headers for the LZ4 payload
(int codec + string-dict codec), the per-dict-entry byte count, and
the dict total-bytes field. The encoder previously narrowed int64 /
size_t sources into those int32 fields with no bounds check. A >2 GiB
shuffle buffer would write a truncated length to the header while
still copying the full payload — the reader would either reject the
body (good case) or, if the truncated value happens to fit within the
remaining body length, silently mis-decode.

Add INT32_MAX guards at all four narrowing sites with clear
`arrow::Status::Invalid` returns:

  * writeLz4Body (integer-codec LZ4 fallback): lz4Len
  * compressStringDict (dict body header): dictBytes total
  * compressStringDict (per-entry header): sv.size() for each entry
  * compressStringDict (string-dict LZ4 fallback): lz4Len

Wire format unchanged on the < 2 GiB hot path; no behaviour change
for existing workloads. The decode side already reads int32 fields
with bound-checked length comparisons, so no reader-side change is
needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant