-
Notifications
You must be signed in to change notification settings - Fork 343
Description
Is your feature request related to a problem or challenge?
Currently iceberg-rust doesn't provide a way to see changes between two snapshots. In Spark, through Iceberg Java implementation, this is done using create_changelog_view. This is very useful for doing change data capture on top of Iceberg tables.
Describe the solution you'd like
The output for Spark's create_changelog_view, in default mode, is something like this:
where each row shows its user-defined columns, with addition of 3 metadata columns (_change_type, _change_ordinal, _commit_snapshot_id).
The way Java code does it is incremental, meaning only the data between the optional timestamps (or commit IDs) is processed. Here are some references:
openChangelogScanTask in https://github.com/apache/iceberg/blob/efbfb7ef9addeb33e72208c927936e50b92d3357/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java
doPlanFiles in https://github.com/apache/iceberg/blob/6ec3de390d3fa6e797c6975b1eaaea41719db0fe/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java
BaseAddedRowsScanTask and BaseDeletedDataFileScanTask. BaseDeletedRowsScanTask is unused, which means that Spark doesn't support row-level deletes, only copy-on-write kind of deletes, for the changelog scan. But it would be good if Rust actually supported that as well, I see no particular reason why this wasn't supported in Spark.
The create_changelog_view has several options, and perhaps we don't have to support them all in Rust immediately, but over time.
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community