[branch-0.9] Cherry pick feat(reader): Add read_with_metrics() for scan I/O metrics (#2349) by toutane · Pull Request #15 · DataDog/iceberg-rust

toutane · 2026-05-05T08:18:50Z

CI is disabled for this fork, so testing was performed using make check and make test in the provided testing environment

## Which issue does this PR close? - Closes #. ## What changes are included in this PR? Add always-on per-scan I/O metrics to `ArrowReader`. **Motivation:** Downstream engines need per-scan byte counts for their UIs. For example, DataFusion Comet uses this to populate `bytes_scanned` on its Iceberg scan operator, which flows through to Spark UI via `TaskMetrics.inputMetrics.setBytesRead()`. This must be per-scan, not global. Concurrent scans against the same `FileIO` need independent counters. The approach matches DataFusion's pattern of wrapping `AsyncFileReader` with a counting layer and is storage-backend agnostic. **`ArrowReader::read()` now returns `ScanResult`** - `ScanResult` wraps the record batch stream and `ScanMetrics`. Accessors: `stream()`, `metrics()`, `into_parts()`. - Metrics are always collected. One `fetch_add(Relaxed)` per I/O request, negligible overhead. - Counter is created fresh per `read()` call, so cloned readers get independent metrics. **New file: `crates/iceberg/src/arrow/scan_metrics.rs`** - `CountingFileRead<F: FileRead>`: generic wrapper that increments a shared `AtomicU64` on each `read()`. - `ScanMetrics`: public handle exposing `bytes_read()`. - `ScanResult`: public struct returned by `ArrowReader::read()`. **`FileRead` blanket impl for `Box<dyn FileRead>`** - Enables generic `CountingFileRead<F>` to wrap the boxed reader returned by `FileIO::reader()`. **Single `open_parquet_file` with counting** - All Parquet opens (data files and delete files) go through the same `open_parquet_file` wrapped with `CountingFileRead`, so `bytes_read` reflects total scan I/O. - `build_parquet_reader()`: shared internals for reader construction and metadata loading. **`FileScanTaskReader` struct (refactor)** - Extracted `process_file_scan_task`'s parameters into a `Clone` struct with a `process(self, task)` method, resolving a `clippy::too_many_arguments` violation. Struct and impl are co-located. **Re-exports** - `ScanMetrics` and `ScanResult` re-exported from `iceberg::arrow` and `iceberg::scan`. ## Are these changes tested? `test_scan_metrics_bytes_read` in `reader.rs`: asserts `bytes_read() == 0` before stream consumption (the stream is lazy) and `bytes_read() > 0` after. `test_scan_metrics_includes_delete_file_bytes`: reads the same data file with and without a positional delete file and asserts `bytes_read` is strictly greater when deletes are present. All existing reader and scan tests pass (updated to use `ScanResult::stream()`). --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: blackmwk <liurenjie1024@outlook.com> (cherry picked from commit 1ad4bfd)

toutane force-pushed the branch-0.9-cherry-pick-6 branch from 903cb09 to c866a7c Compare May 5, 2026 09:20

toutane mentioned this pull request May 5, 2026

[branch-0.9] Cherry pick feat!(runtime): Support custom Runtime in Catalog (#2308) #17

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[branch-0.9] Cherry pick feat(reader): Add read_with_metrics() for scan I/O metrics (#2349)#15

[branch-0.9] Cherry pick feat(reader): Add read_with_metrics() for scan I/O metrics (#2349)#15
toutane wants to merge 1 commit intobranch-0.9from
branch-0.9-cherry-pick-6

toutane commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

toutane commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

toutane commented May 5, 2026 •

edited

Loading