[VL][DELTA] Support Delta CDF scan offload#12218
Open
malinjawi wants to merge 4 commits into
Open
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Addresses #12195.
Delta CDF reads enter Spark as
CDCReader.DeltaCDFRelation, so they do not initially have the normalFileSourceScanExec+DeltaParquetFileFormatshape that Gluten's existing Delta scan offload rule recognizes.This PR adds a Gluten Delta planner strategy, wired from the Velox Delta component, that recognizes
DeltaCDFRelation, expands it through Delta's own CDF batch planning path, and rewrites the original projection/filter attributes onto the expanded logical plan. After that, the existing Delta scan offload path can plan the underlying CDF file scans asDeltaScanTransformer.The change is intentionally scoped to Gluten's Delta/Spark planning layer rather than Velox C++:
DeltaCDFScanStrategyfortable_changes(...)and DataFramereadChangeFeedscans.VeloxDeltaComponent.readChangeFeed, column mapping, and astartingVersion = 0case.One planner/test nuance: Delta CDF expansion can keep a Spark-side
ExistingRDDbranch for synthesized change rows, including on the tested update/delete CDF paths. The regression suite therefore compares results against vanilla Spark and asserts that the expanded CDF file scans are transformed toDeltaScanTransformer, rather than requiring the entire expanded CDF union to be globally fallback-free.How was this patch tested?
Local checks used
JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home.git diff --check./dev/format-scala-code.sh check./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compile./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests test-compileEarlier cross-version compile checks also passed before the test-expectation-only update:
./build/mvn -pl gluten-delta -am -Pspark-3.3 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-3.4 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-3.5 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-4.0 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compile./build/mvn -pl gluten-delta -am -Pspark-4.1 -Pscala-2.13 -Pjava-17 -Pbackends-velox -Pdelta -DskipTests -Dcheckstyle.skip=true -Dscalastyle.skip=true -Dspotless.check.skip=true test-compileFull native Velox runtime and benchmarking are left to CI / a native Gluten-optimized environment; this local checkout does not have
cpp/build/releases/libgluten.soand the Velox external project build available.Was this patch authored or co-authored using generative AI tooling?
Generated-by: IBM BOB