Skip to content

[VL][Delta] Add DV scan info extraction utility#12197

Merged
zhztheplayer merged 3 commits into
apache:mainfrom
malinjawi:split/delta-dv-scan-info-utils-pr
Jun 3, 2026
Merged

[VL][Delta] Add DV scan info extraction utility#12197
zhztheplayer merged 3 commits into
apache:mainfrom
malinjawi:split/delta-dv-scan-info-utils-pr

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

This PR is the next split from the Delta deletion-vector (DV) scan stack, following the native reader support already merged in #12040 and before the full JVM scan handoff work from #12131.

It adds a focused Scala utility layer that extracts the essential DV scan information from Spark/Delta PartitionedFile metadata without changing scan offload behavior yet.

Main changes:

  • add DeltaDeletionVectorScanInfo for Delta 3.3 and Delta 4.0 source sets
  • extract per-file DV scan info from PartitionedFile metadata:
    • row-index filter type
    • deletion-vector descriptor and cardinality
    • serialized DV bitmap payload bytes
    • normalized non-DV metadata columns
  • keep the utility independent from Substrait, Velox native split conversion, and scan offload behavior
  • add focused Delta 3.3 and Delta 4.0 tests for DV extraction, keep-all/no-DV extraction, and invalid partial DV metadata

This PR is intentionally utility-only:

  • no Substrait proto changes
  • no native/C++ changes
  • no Delta scan rule replacement
  • no end-to-end scan offload behavior change yet

Those pieces stay in follow-up PRs after this API is reviewed.

How was this patch tested?

Validation run:

  • JAVA_HOME=$(/usr/libexec/java_home -v 17) ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • JAVA_HOME=$(/usr/libexec/java_home -v 17) ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • git diff --check

Also attempted the focused suite with dev/run-scala-test.sh, but the local runner failed during classpath compilation before executing the suite while switching profiles locally. The module-level Spark 3.5 and Spark 4.0 test-compile checks above pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

Comment on lines +45 to +46
descriptor: Option[DeletionVectorDescriptor],
serializedDeletionVector: Array[Byte]) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are passing a serialized deletion vector to C++, why do we need to preserve the original DeletionVectorDescriptor alongside?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @zhztheplayer , I removed DeletionVectorDescriptor from the exported scan info and only keep it locally while materializing the serialized DV payload.

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 % a minor nit.

Comment on lines +43 to +47
final case class DeletionVectorInfo(
rowIndexFilterType: RowIndexFilterType,
hasDeletionVector: Boolean,
cardinality: Long,
serializedDeletionVector: Array[Byte])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we reorder to place hasDeletionVector as the first field?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhztheplayer! Will reorder hasDeletionVector first in both Delta 3.3 and Delta 4.0.

@zhztheplayer zhztheplayer merged commit 97f06b4 into apache:main Jun 3, 2026
66 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants