Skip to content

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211

Open
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization
Open

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

@jackylee-ch jackylee-ch commented Jun 1, 2026

What changes

This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability; spark.gluten.sql.columnar.tableCache.partitionStats.enabled now only controls the optional stats/pruning payload.

  • Removes spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.
  • Adds V3 no-stats serialization (statsLen=0) for the default lazy path.
  • Keeps V3 with stats for partition pruning when partition stats are enabled.
  • Keeps V2 stats and legacy raw bytes as native-capability / backward-read fallback paths.
  • Routes V3 cached bytes through projected native deserialization.
  • Adds JVM/native golden, lazy serde, and GHA benchmark coverage.

Performance

4-environment benchmark — eager V2 vs lazy V3, each without and with the optional partition-stats payload.

  • Source: ColumnarTableCacheLazyDeserBenchmark, GitHub Actions Velox Backend (x86) run 26906231294 (branch head 2538fe501).
  • Environment: Linux x86_64, AMD EPYC 7763, JDK 8, JVM heap ~9.95 GiB (-XX:MaxRAMPercentage=70).
  • Dataset: 16-column wide schema, 32 partitions, 3 iterations. The benchmark auto-scales the row count to fit the runner heap across the 4 simultaneously-cached modes; the assigned hosted runner (~14 GiB) yielded 11,856,653 rows (requested 100M — a 100M-scale run needs a larger-RAM runner).
  • Modes: V2 without stats = legacy raw Presto (eager, no pruning); V2 with stats = framedSerializeWithStats (eager + partition-stats pruning); V3 without stats = per-column lazy (default); V3 with stats = per-column lazy + pruning.

Cache footprint (storage memory)

Mode Footprint
V2 without stats 1176.82 MiB
V2 with stats 1176.83 MiB
V3 without stats 1176.83 MiB
V3 with stats 1176.82 MiB

V3 per-column framing does not increase cache size vs eager V2/legacy for flat (non-dictionary) data, and the stats payload is negligible. This addresses the cache-footprint-regression concern.

Build / write — avg ms over 3 iters (lower is better)

Mode Avg Best
V2 without stats 137310 136209
V2 with stats 136835 135598
V3 without stats 137179 136725
V3 with stats 138712 136792

Write time is within ~1% across all four — V3 framing and stats computation add no measurable write overhead (the phase is dominated by range generation + range-repartition shuffle).

Read — avg ms over 3 iters (lower is better)

Mode 1/16 cols, sum(c0) 4/16 cols, group+agg all 16 cols filter + 2/16 cols
V2 without stats 278 1274 5014 73
V2 with stats 270 1268 5002 74
V3 without stats 286 1240 5009 69
V3 with stats 251 1260 5003 103
  • All-column read is identical across modes — V3 LazyVector wrapping adds no overhead when every column is materialized.
  • Projected reads (1/16, 4/16 cols): V3 is comparable-to-slightly-faster (best-time read-1col: V3 240–246 ms vs V2 257–268 ms, ~1.1x). At ~12M rows absolute read times are sub-second and dominated by aggregation compute, so the per-column lazy-decode skip stays within run-to-run noise here; the benefit is expected to widen at larger scale / with decode-bound reads.

Rows were auto-scaled to ~12M to fit the assigned hosted runner. The footprint-parity and no-regression conclusions are scale-robust; re-run on a high-RAM runner (or tune table_cache_benchmark_max_ram_pct) for the 100M/32-partition significance run.

How was this patch tested?

  • ./dev/format-scala-code.sh
  • PATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.sh
  • git diff --check upstream/main..HEAD
  • ruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'
  • ./.github/workflows/util/check.sh upstream/main
  • env CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o
  • ./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skip
  • Local benchmark runability smoke only, not used as PR performance data: Java 8, ColumnarTableCacheLazyDeserBenchmark with 1000 rows, 4 partitions, 1 iteration, phases build,read1,read4,readAll,filter.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DOCS labels Jun 1, 2026
@jackylee-ch jackylee-ch marked this pull request as draft June 1, 2026 04:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare June 1, 2026 08:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare June 1, 2026 09:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare June 1, 2026 09:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare June 1, 2026 11:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch marked this pull request as ready for review June 1, 2026 14:20
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

@yaooqinn PTAL

@yaooqinn
Copy link
Copy Markdown
Member

yaooqinn commented Jun 2, 2026

Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands:

1. Benchmark needs to be re-run. The checked-in -results.txt is 10K rows / 4 partitions / 1 iteration on an Apple M5 Pro — Stdev=0 across the board because there's only one sample. Differences in the 1-3 ms range (e.g. "1.1X" at all-16-cols read, where lazy mode physically cannot be faster than eager) are noise. Also build 1.9X is surprising because V3 does N serializeSingleColumn calls vs V2's single-pass batchSerialize — the ordering legacy > V2 > V3 doesn't match the physical work done; this needs reruns on a server / GHA-equivalent runner with iter≥3 and 100M rows / 32 partitions (matching the code defaults). Please also add a cache memory footprint column — V3 per-col framing + getFlattenedRowVector() flattening Dictionary/Constant encodings could regress cache size significantly for dict-encoded payloads, and that's currently unmeasured.

2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry statsBlob), so this isn't a new behavioral feature — it's a wire-format upgrade. Adding a dedicated lazy.deserialization.enabled boolean commits Gluten to maintaining three cache paths (legacy / V2-stats / V3-lazy-and-stats) and a three-level fallback chain. Once we trust V3, we'd want to deprecate V2-stats, which means another deprecation cycle. Could we either (a) skip the conf and gate V3 behind partitionStats.enabled once it's stable, or (b) turn partitionStats.enabled into a string conf with off | v2 | v3 values? Configuration.md already warns "V3 is NOT backward compatible with V2 readers" + default=false — operationally nobody is going to flip this, so the conf risks being long-lived dead code.

3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via craftV3Framed. We just established the cpp-golden ↔ JVM-parser round-trip pattern in #12196 specifically because layout drift between halves is a correctness hazard. V3 needs the same: a framedSerializeWithStatsV3Golden cpp test pinning a byte-stable literal + a JVM parser round-trip over that same literal.

4. Smaller items.

  • All-null column case not covered (we hit the PrestoSerde uninit-values bug in [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning #12092 development, same risk class for per-col path).
  • getFlattenedRowVector() side effect on Dictionary/Constant encoding not documented.
  • The // JNI pin outlives comment in deserializeV3 describes a non-issue (copies are made synchronously in step 6, the lazy loader doesn't depend on the pin) — please trim.
  • Two near-identical magic checks (parseFramedBytes byte[3] dispatch vs isV3Format 4-byte compare) — please consolidate.
  • Consider folding statsExtV3AvailableFlag and statsExtAvailableFlag into a single capability enum (Unknown | V2 | V3 | Unavailable) — two independent one-shot latches double the operational diagnosis surface.

Happy to file any of these as separate issues if it helps.

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8b09d6b to 09679ee Compare June 2, 2026 06:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 09679ee to ab9e0f7 Compare June 2, 2026 06:30
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab9e0f7 to 144e816 Compare June 2, 2026 06:47
@github-actions github-actions Bot removed the CORE works for Gluten Core label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 2 times, most recently from b77f4ab to 9a0f96a Compare June 2, 2026 07:28
@github-actions github-actions Bot removed the DOCS label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9a0f96a to b5b1906 Compare June 2, 2026 09:01
@github-actions github-actions Bot added the INFRA label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from 2b96545 to c3cc1bd Compare June 2, 2026 15:28
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the CORE works for Gluten Core label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from c3cc1bd to 97a6019 Compare June 3, 2026 03:42
@github-actions github-actions Bot added the DOCS label Jun 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 97a6019 to 9971c91 Compare June 3, 2026 03:52
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9971c91 to f576df8 Compare June 3, 2026 06:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f576df8 to f17dc6a Compare June 3, 2026 06:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f17dc6a to cda20eb Compare June 3, 2026 09:27
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from decdd0e to ab055c5 Compare June 3, 2026 14:16
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes.

Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results.

Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab055c5 to 2538fe5 Compare June 3, 2026 18:55
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS INFRA VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants