Skip to content

[GLUTEN-10215][VL] Delta write: Fix native partitioned layout accounting#12016

Open
malinjawi wants to merge 1 commit into
apache:mainfrom
malinjawi:feature/delta-native-write-layout-accounting
Open

[GLUTEN-10215][VL] Delta write: Fix native partitioned layout accounting#12016
malinjawi wants to merge 1 commit into
apache:mainfrom
malinjawi:feature/delta-native-write-layout-accounting

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented Apr 30, 2026

What changes were proposed in this pull request?

This patch fixes native Delta partitioned write layout accounting in the Velox backend.

The change:

  • Writes each native partition stripe as its own accounting unit.
  • Enforces maxRecordsPerFile within native partition stripes by slicing columnar batches when needed.
  • Updates recordsInFile by the actual written chunk row count instead of the original input batch row count.
  • Adds Delta 4.0 coverage for optimized partitioned native writes with stats enabled.

The same writer accounting fix is applied to Delta 3.3 and Delta 4.0 sources.

The partition-column split-output contract has been moved out of this PR and into a follow-up branch: malinjawi:feature/delta-native-write-partition-column-output.

Why are the changes needed?

The existing native partitioned writer split incoming Velox batches by partition, but it accounted file layout at the original batch level. That can make partitioned optimized writes violate maxRecordsPerFile when a single native partition stripe is larger than the file limit.

Does this PR introduce any user-facing change?

No public API change. This improves correctness/layout behavior for native Delta writes.

How was this patch tested?

Built locally and ran:

JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-17.jdk/Contents/Home \
./dev/run-scala-test.sh --force \
  -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta \
  -pl backends-velox \
  -s org.apache.spark.sql.delta.DeltaNativeWriteSuite \
  -t "native delta optimized partitioned write should collect stats and honor file layout"

Result: 1 test passed, 0 failures.

Also ran Spark 3.5 backends-velox test compilation and git diff --check.

Related issue: #10215

@zhztheplayer
Copy link
Copy Markdown
Member

@malinjawi

Enforces maxRecordsPerFile within native partition stripes by slicing columnar batches when needed.

Preserves partition columns in split output only when Delta's write contract includes partition columns in dataColumns.

Thanks. Can we separate the above 2 parts into 2 PRs?

@malinjawi
Copy link
Copy Markdown
Contributor Author

malinjawi commented May 11, 2026

Thanks @zhztheplayer! Split done.

I force-pushed this PR down to the native partition-stripe layout accounting fix only, and rebased it on current main, so the DeltaNativeWriteSuite conflict is gone. The partition-column split-output contract is moved to draft follow-up PR #12069.

This PR now keeps only:

  • native partition stripe accounting
  • maxRecordsPerFile slicing within native partition stripes
  • the optimized partitioned write layout/stat coverage

@malinjawi malinjawi force-pushed the feature/delta-native-write-layout-accounting branch from a8a4db2 to 5e1262c Compare May 11, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants