-
Notifications
You must be signed in to change notification settings - Fork 0
feat(core): Add incremental scan for appends and positional deletes #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 363 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 296 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 366 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 2 | 65 | 0 |
Click to see the invalid file list
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has checked 367 files.
| Valid | Invalid | Ignored | Fixed |
|---|---|---|---|
| 299 | 3 | 65 | 0 |
Click to see the invalid file list
- crates/iceberg/src/scan/incremental/tests.rs
- crates/playground/Cargo.toml
- crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
</details>
…eberg-rust into gb/incremental-bootstrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gerald, looking good to me, just a few last follow ups.
| })? | ||
| .clone(); | ||
|
|
||
| // TODO: What properties do we need to verify about the snapshots? What about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this comment and track it separately elsewhere (Jira or google doc with open questions)
Co-authored-by: Vukasin Stefanovic <[email protected]>
…eberg-rust into gb/incremental-bootstrap
* WIP, initial draft of incremental scan * . * . * cargo fmt * Implement unzipped stream * Remove printlns * Add API method for unzipped stream * . * Remove comment * Rename var * Add import * Measure time * Fix typo * Undo some changes * Change type name * Add comment header * Fail when encountering equality deletes * Add comments * Add some preliminary tests * Format * Remove playground * Add more tests * Clippy * . * . * Adapt tests * . * Add test * Add tests * Add tests * Format * Add test * Format * . * Rm newline * Rename trait function * Reuse schema * . * remove clone * Add test for adding file_path column * Make `from_snapshot` mandatory * Error out if incremental scan encounters neither Append nor Delete * . * Add materialized variant of add_file_path_column * . * Allow dead code * Some PR comments * . * More PR comments * . * Add comments * Avoid cloning * Add reference to PR * Some PR comments * . * format * Allow overwrite operation for now * Fix file_path column * Add overwrite test * Unwrap delete vector * . * Add assertion * Avoid cloning the mutex guard * Abort when encountering a deleted delete file * Adjust comment * Update crates/iceberg/src/arrow/reader.rs Co-authored-by: Vukasin Stefanovic <[email protected]> * Add check * Update crates/iceberg/src/scan/incremental/mod.rs --------- Co-authored-by: Vukasin Stefanovic <[email protected]>
FFI and Julia Bindings for incremental scan changes that were introduced with RelationalAI/iceberg-rust#3. I refactored the file structures a bit, for both Rust and Julia code. Rust code reuses some common parts through macros, for Julia I didn't bother to do that (mostly afraid of macros and ccall interaction being a rabbit hole with little benefit). Note that I had to use struct instead of const for `ScanRef`, since now with additional type, we have method overloads, which if we use `ScanRef` const aliases actually use same type, and then become overwrites instead of overloads. There's also a new test data, and new test that exercises positional delete and inserts. --------- Co-authored-by: Gerald Berger <[email protected]>
Closes RAI-43289.
Closes RAI-43292.
Incremental Scan Implementation
Summary
This PR introduces Incremental Scan functionality to the Iceberg Rust implementation, enabling efficient querying of changes between table snapshots. Incremental scans return the net changes (appends and deletes) between two snapshots, which is essential for incremental data processing workflows, change data capture (CDC), and efficient data pipeline operations.
Key Features
Incremental Scan API
IncrementalScanbuilder with fluent API for configuring scans between snapshots.select()for efficient data retrieval.with_batch_size()for memory optimizationFile Path Tracking
_filecolumn to all delete record batches containing the source parquet file pathRunEndEncodedarrays for memory-efficient file path storage in non-empty batches-2048)Net Change Computation
Implementation Details
Core Components
IncrementalScanBuilder (scan/incremental/mod.rs)ArrowReaderinfrastructureStreaming Implementation (arrow/incremental.rs)
StreamsIntotrait with.stream()method for converting scan tasks to Arrow record streamsFile Path Column Addition (arrow/reader.rs)
add_file_path_column()function adds_filecolumn to record batchesRestrictions
Testing
Comprehensive test suite added in scan/incremental/tests.rs:
Test Fixture
IncrementalTestFixture- Helper for creating test tables with controlled snapshotsAddoperations with custom file names and dataDeleteoperations with position and file trackingverify_incremental_scan()for asserting expected resultsTest Coverage
test_incremental_fixture_simple- Basic append and delete operationstest_incremental_fixture_complex- Multiple snapshots with overlapping operationstest_incremental_scan_edge_cases- Edge cases across 7 snapshots and 3 data filestest_incremental_scan_builder_options- Builder API functionality.select())test_add_file_path_column- Unit tests for file path column additionAll tests passing: ✅ 4 incremental scan tests, ✅ 3 file path column tests
API Example