Skip to content

Support (incremental) changelog scan for Change Data Capture use-cases #1636

@vustef

Description

@vustef

Is your feature request related to a problem or challenge?

Currently iceberg-rust doesn't provide a way to see changes between two snapshots. In Spark, through Iceberg Java implementation, this is done using create_changelog_view. This is very useful for doing change data capture on top of Iceberg tables.

Describe the solution you'd like

The output for Spark's create_changelog_view, in default mode, is something like this:

Image

where each row shows its user-defined columns, with addition of 3 metadata columns (_change_type, _change_ordinal, _commit_snapshot_id).

The way Java code does it is incremental, meaning only the data between the optional timestamps (or commit IDs) is processed. Here are some references:
openChangelogScanTask in https://github.com/apache/iceberg/blob/efbfb7ef9addeb33e72208c927936e50b92d3357/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java
doPlanFiles in https://github.com/apache/iceberg/blob/6ec3de390d3fa6e797c6975b1eaaea41719db0fe/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java
BaseAddedRowsScanTask and BaseDeletedDataFileScanTask. BaseDeletedRowsScanTask is unused, which means that Spark doesn't support row-level deletes, only copy-on-write kind of deletes, for the changelog scan. But it would be good if Rust actually supported that as well, I see no particular reason why this wasn't supported in Spark.

The create_changelog_view has several options, and perhaps we don't have to support them all in Rust immediately, but over time.

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions