Skip to content

[VL][DELTA] Support Delta CDF scan offload#12218

Open
malinjawi wants to merge 4 commits into
apache:mainfrom
malinjawi:codex/delta-cdf-offload
Open

[VL][DELTA] Support Delta CDF scan offload#12218
malinjawi wants to merge 4 commits into
apache:mainfrom
malinjawi:codex/delta-cdf-offload

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented Jun 1, 2026

What changes are proposed in this pull request?

Addresses #12195.

Delta CDF reads enter Spark as CDCReader.DeltaCDFRelation, so they do not initially have the normal FileSourceScanExec + DeltaParquetFileFormat shape that Gluten's existing Delta scan offload rule recognizes.

This PR adds a Gluten Delta planner strategy, wired from the Velox Delta component, that recognizes DeltaCDFRelation, expands it through Delta's own CDF batch planning path, and rewrites the original projection/filter attributes onto the expanded logical plan. After that, the existing Delta scan offload path can plan the underlying CDF file scans as DeltaScanTransformer.

The change is intentionally scoped to Gluten's Delta/Spark planning layer rather than Velox C++:

  • Add DeltaCDFScanStrategy for table_changes(...) and DataFrame readChangeFeed scans.
  • Add Delta-version helper shims for Delta 2.3, 2.4, 3.3, and 4.x API differences.
  • Register the planner strategy from VeloxDeltaComponent.
  • Add Delta regression coverage for insert/update/delete CDF rows, filter/projection handling, bounded version reads, DataFrame readChangeFeed, column mapping, and a startingVersion = 0 case.

One planner/test nuance: Delta CDF expansion can keep a Spark-side ExistingRDD branch for synthesized change rows, including on the tested update/delete CDF paths. The regression suite therefore compares results against vanilla Spark and asserts that the expanded CDF file scans are transformed to DeltaScanTransformer, rather than requiring the entire expanded CDF union to be globally fallback-free.

How was this patch tested?

Local checks used JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home.

  • git diff --check
  • ./dev/format-scala-code.sh check
  • ./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile

Earlier cross-version compile checks also passed before the test-expectation-only update:

  • ./build/mvn -pl gluten-delta -am -Pspark-3.3 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-3.4 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile
  • ./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile

Full native Velox runtime and benchmarking are left to CI / a native Gluten-optimized environment; this local checkout does not have cpp/build/releases/libgluten.so and the Velox external project build available.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi marked this pull request as ready for review June 1, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant