TAC improvements#12220
Draft
luis4a0 wants to merge 7 commits into
Draft
Conversation
07e4718 to
a9ed670
Compare
Extends the type-aware shuffle compression codec with three new codecs:
- kUInt128 (HUGEINT / LongDecimal, split-lane FFOR) for 16-byte integers.
- kUInt32 (INTEGER + string-offsets, widened FFOR) for 4-byte integers.
- kStringDict (adaptive dictionary vs LZ4) for variable-length string
DATA buffers.
Additional improvements:
- Adaptive LZ4 fallback on the integer codecs: the on-wire body is the
smaller of native-FFOR vs LZ4, so the codec never inflates payloads
relative to a plain LZ4 baseline.
- kStringDict sliced-offsets handling: detect non-zero starting offsets
and fall back to LZ4 so we don't reproduce stale prefix/suffix bytes.
- kStringDict CPU regression guard for unique-string workloads: a single
deterministic probe at clamp(numRows/8, 256, 4096) bails to LZ4 when
no duplicate is seen, capping dict-build CPU on adversarial inputs.
- Decompress output-size validation for kStringDict, kFForSplit128, and
kFForWidened32 (guards against silent corruption).
- Payload.cc kStringDict path uses the authoritative numRows from the
payload header and validates the offsets buffer shape.
Gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled
(default OFF); when off, this PR is a no-op identical to current main.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…leanup * core/utils/tac/ffor.hpp: 4-arg decompress64 (outputMaxValues), tail block size validation, encodeRt/decodeRt bw>64 guards, nValues==0 preload skip — defends against malformed/truncated FFOR streams. * core/utils/tac/FForCodec.cc: forward outputMaxValues to ffor::decompress64. * core/utils/tac/TypeAwareCompressCodec.cc: in decompressSplit128 and decompressWidened32, the FFOR decompress result count is a *value* count (uint64s produced), not a byte count. Compare against the expected value count rather than against the byte length. * core/tests/FForCodecTest.cc: convert TEST_F to TEST and remove the empty fixture class. gtest forbids mixing TEST and TEST_F in the same test suite, and the fixture had no state so the conversion is behaviour-preserving.
…s tests Two TypeAwareCompressCodecTest cases asserted that the kStringDict codec emits the dict strategy on (a) a constant-value column and (b) 10K rows drawn from a 3K-pool of long comment-like strings with a shared prefix. On both shapes LZ4 actually compresses better than the dict body: - Constant data is LZ4's best case (a single match reference spans the whole input, output is on the order of hundreds of bytes), whereas the dict body still pays for one dict entry + N index bytes. - Long shared-prefix strings have ~66 bytes of identical leading text that LZ77 captures very cheaply, beating the dict body's per-row 2-byte index overhead at this cardinality. The codec's actual contract is 'pick the smaller of dict body and LZ4 body', which is exactly what it does on these inputs. The tests' intent was to verify that the v3 cardinality probe (Guard 2) does NOT false-bail on duplicate-rich data; that part is still preserved implicitly (the dict body is built and compared, otherwise the codec could not have decided LZ4 was smaller). Rename both tests to *Roundtrips, document that either strategy is acceptable on these shapes, drop the wire-strategy assertion, and keep the roundtrip + compression-ratio checks (both strategies satisfy the size bounds).
… fallback for lengths shape compressStringDict reads offsets[numRows] as the data-buffer end sentinel. That is in-bounds for shape-a (Arrow standard: validity, offsets[numRows+1], data) but ONE PAST THE END for shape-b (validity, lengths[numRows], data), since the lengths buffer holds exactly numRows int32 elements. The codec's existing 'sliced' early-exit (offsets[0] != 0) masked this in the common case where shape-b rows have non-zero first-row length — but when the first row is empty/null, offsets[0] == 0 and execution falls through to offsets[numRows], producing an out-of-bounds read. Fix: detect shape-a explicitly in Payload.cc (prevSize equals (numRows+1) * sizeof(int32_t)) and only route those inputs through compressStringDict. Shape-b and any other unexpected shape fall through to the standard LZ4/ZSTD codec — which is what the codec's internal 'sliced' fallback was producing anyway in the common case, so observable compressed size on shape-b user data is unchanged. Also preserves the authoritative-numRows handling so single-row batches do not underflow to numRows == 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…> 2048 Guard 2 is the single deterministic probe that bails the kStringDict dict-build loop to LZ4 when no duplicate has been seen by row clamp(numRows/8, kDictBailoutMinCheckRows, kDictBailoutMaxCheckRows). The cap was originally set to 4096 to be very conservative against the v2-era birthday-paradox false-positive (a sampled mid-cardinality column that happened to yield 256 distinct draws in a row). A cap of 2048 is still safe against that false-positive: for a cardinality C, the probability of seeing zero duplicate in 2048 draws is approximately exp(-2048^2/(2*C)). At C=8K this is ~10^-114; at C=64K it is still ~10^-14; only essentially-all-unique columns ever bail. Tightening from 4096 -> 2048 halves the work done before bailout on the str_unique shape (where bailing IS the right call), without affecting any shape that legitimately picks dict (e.g. low-cardinality keys, long-shared-prefix comments). Updated test comments to reflect the new cap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-comments shapes The recent *Roundtrips tests (StringDictConstantColumnRoundtrips, StringDictLongCommentsRoundtrips) only validate the input/output contract: the codec succeeds, the output shrinks, and decompress is byte-exact — they intentionally accept either strategy because both dict and LZ4 satisfy that contract. This commit adds two complementary tests, StringDictConstantColumnPicksLz4 and StringDictLongCommentsPicksLz4, that PIN the deterministic strategy choice on those same shapes: * On a constant-value column LZ4 is dominated by a single 64KB-windowed match reference and produces only hundreds of bytes, beating the dict body (header + entry + 10K 1B indices ~ 10KB) by 50-100x. * On 10K rows drawn from a 3K-pool of ~80-char strings sharing a 66-char template prefix, LZ4 captures the shared prefix as a single repeated match plus per-row tails, beating the ~272KB dict body (3K entries + 10K 2B indices) by ~5x. Pinning the choice catches regressions where a future change to dict-header accounting (or to the strategy decision) silently makes the codec prefer dict on shapes where LZ4 is much smaller — which would bloat the wire form for some of the most common low-cardinality patterns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
53060ff to
c2f3197
Compare
The TAC wire format uses 4-byte length headers for the LZ4 payload (int codec + string-dict codec), the per-dict-entry byte count, and the dict total-bytes field. The encoder previously narrowed int64 / size_t sources into those int32 fields with no bounds check. A >2 GiB shuffle buffer would write a truncated length to the header while still copying the full payload — the reader would either reject the body (good case) or, if the truncated value happens to fit within the remaining body length, silently mis-decode. Add INT32_MAX guards at all four narrowing sites with clear `arrow::Status::Invalid` returns: * writeLz4Body (integer-codec LZ4 fallback): lz4Len * compressStringDict (dict body header): dictBytes total * compressStringDict (per-entry header): sv.size() for each entry * compressStringDict (string-dict LZ4 fallback): lz4Len Wire format unchanged on the < 2 GiB hot path; no behaviour change for existing workloads. The decode side already reads int32 fields with bound-checked length comparisons, so no reader-side change is needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR extends the type-aware compression from #11894 with three new codecs:
Additional improvements:
TAC is gated by spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled (default OFF), as it was in the original PR.