Commit d339391
authored
perf(add_files): stream manifest entries for duplicate-files check (#3287)
Fixes #3286.
## The hot path today
```python
if check_duplicate_files:
import pyarrow.compute as pc
expr = pc.field("file_path").isin(file_paths)
referenced_files = [file["file_path"] for file in self._table.inspect.data_files().filter(expr).to_pylist()]
```
What this actually does, per call:
1. `inspect.data_files()` — for every manifest in the current snapshot,
calls `_get_files_from_manifest` (`pyiceberg/table/inspect.py:548`),
which for each `ManifestEntry` builds a Python dict with ~17 fields. The
expensive ones:
- `readable_metrics` — for **every column** in the table schema, decodes
lower/upper bound bytes via `from_bytes` and packs the result into a
per-column dict. This is the single biggest cost on wide tables.
- `partition` — decodes the partition struct into a name → value dict.
- `column_sizes`, `value_counts`, `null_value_counts`,
`nan_value_counts`, `lower_bounds`, `upper_bounds` — each materialized
as a Python dict per file.
2. The dicts are batched into a `pyarrow.Table` per manifest.
3. `pa.concat_tables` glues all manifests' Tables together.
4. `.filter(expr)` applies an Arrow-compute `isin` over the concatenated
Table.
5. `.to_pylist()` converts back to Python dicts.
6. The list comprehension throws away 16 of the 17 columns and keeps
only `file_path`.
For a backfill that calls `add_files` once per day on a growing table,
per-call cost is O(snapshot file count); cumulative cost is O(N²). After
~15 daily commits on a wide-schema table, dup-check time dominates: each
call takes ~10–15 minutes vs seconds early on.
The workaround in #2132 / docs PR #2249 is
`check_duplicate_files=False`, which trades away the idempotency
guarantee that re-running a partial-failure resume is safe.
## Benchmark — before vs after
`tests/benchmark/bench_add_files_dup_check.py` (added in this PR) runs
10 sequential `add_files(check_duplicate_files=True)` calls on an
`InMemoryCatalog` table with a 30-column schema, 200 small parquet files
per call. Measures wall-clock and `tracemalloc` peak per call. Run on
macOS arm64 / Python 3.11.
**Before (upstream `main`):**
```
batch wall_s tracemalloc_peak_MB cumulative_files
0 1.05 5.5 200
1 1.00 9.2 400
2 1.06 12.7 600
3 1.13 18.5 800
4 1.18 20.2 1000
5 1.26 23.4 1200
6 1.32 30.2 1400
7 1.39 34.0 1600
8 1.46 29.9 1800
9 1.51 39.8 2000
```
Wall climbs ~44%; tracemalloc peak grows ~7.2×.
**After (this PR):**
```
batch wall_s tracemalloc_peak_MB cumulative_files
0 1.05 5.5 200
1 1.56 5.6 400
2 0.96 6.2 600
3 0.97 6.3 800
4 0.98 6.5 1000
5 1.00 6.8 1200
6 1.00 6.6 1400
7 1.03 8.2 1600
8 1.04 7.2 1800
9 1.07 6.9 2000
```
Wall flat at ~1s; tracemalloc peak flat at ~6–8 MB. The growth
disappears because the dup-check no longer materializes per-file dicts /
pyarrow Tables / readable_metrics — it just does set containment on
`file_path` while streaming manifest entries.
This is a 10-batch run on a small, narrow workload. Real backfills with
wider schemas (more columns × more row groups), more files per batch,
and many more batches see the constant factor amplify; the production
workload that motivated this PR was hitting ~10–15 minutes per call
after 15 commits.
## What this PR does
Replace the materialize-then-filter with a streaming scan that reuses
the existing `_open_manifest` helper
(`pyiceberg/table/__init__.py:1918`) — the canonical "open a manifest,
fetch entries with `discard_deleted=True`, apply data-file predicates"
pattern already used by `DataScan.scan_plan_helper` (line 2050). Delete
manifests are skipped at the top level (same shape as
`_min_sequence_number`).
The loop body becomes a `set` containment check on
`data_file.file_path`, scheduled via `executor.map` and flattened with
`chain.from_iterable` — same idiom as the existing scan path.
The same approach Spark's `add_files` action takes: predicate-based
against the new paths only, no pre-scan of all data files.
## What this is and isn't
- **Is**: a constant-factor reduction. The Avro decode of manifest
entries is unchanged (still happens via `fetch_manifest_entry`), but
everything downstream of the read — `readable_metrics` computation,
partition decode, per-file dict construction, pyarrow Table
construction, `concat_tables`, `filter`, `to_pylist` — is gone. That
post-processing was the bulk of the time, not the Avro read.
- **Isn't**: an asymptotic fix. Per-call cost is still O(snapshot file
count) for the manifest entry reads; cumulative backfill cost is still
O(N²). Truly eliminating the linear scan would need `file_path`
lower/upper bounds at the `ManifestFile` level so most manifests can be
pruned without opening — that's a spec extension and a follow-up.
## Compatibility / behavior preservation
Audited the change for any behavioral divergence from the old
`inspect.data_files().filter(...)` path:
- **Public API**: `add_files` signature and exception message unchanged.
Existing integration tests at
`tests/integration/test_add_files.py:test_add_files_that_referenced_by_current_snapshot{,_with_check_duplicate_files_true,_with_check_duplicate_files_false}`
exercise the dup-check contract and assert the exact error string — both
preserved verbatim.
- **Callers**: only `Table.add_files`
(`pyiceberg/table/__init__.py:1491`). No subclass overrides exist (e.g.
`CreateTableTransaction` doesn't redefine it).
`Transaction.upsert`/`append`/`overwrite`, `_FastAppendFiles`,
`MergingSnapshotProducer` don't share the dup-check path.
- **File set scanned**: `inspect.data_files()` filtered per-entry on
`DataFileContent.DATA`; new code filters at `ManifestContent.DATA`.
These are theoretically distinct but produce identical sets per the
Iceberg spec — delete entries cannot live in DATA manifests.
- **`discard_deleted`**: both paths use `True` (`fetch_manifest_entry`
defaults to `True`; `_open_manifest` passes it explicitly).
- **Snapshot scope**: both paths use `current_snapshot()` —
`inspect.data_files()` via `_get_snapshot(None)`, new code directly via
`self.table_metadata.current_snapshot()`.
- **Empty `file_paths`**: same result (empty list) and same exceptions
either way. Slight efficiency regression in this edge case — the new
code still walks data manifests where the old code short-circuited via
`pc.field("file_path").isin([])`. Not user-visible; can be optimized in
a follow-up if anyone cares.
- **Side effects**: both paths are read-only; no manifest cache state
mutation, no transaction state changes.
- **Concurrency**: both submit to the shared
`ExecutorFactory.get_or_create()` thread pool.
- **Branch parameter**: `add_files` accepts a `branch` argument, but the
dup-check has always run against `current_snapshot()` (i.e. main)
regardless. This is a **pre-existing inconsistency**, not introduced by
this PR. Preserved exactly to keep this change behavior-preserving.
## Refs
- Issue: #3286
- Related: #2132 (closed as docs), #2133 (parallelization)1 parent b31b762 commit d339391
2 files changed
Lines changed: 155 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
940 | 940 | | |
941 | 941 | | |
942 | 942 | | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
943 | 965 | | |
944 | 966 | | |
945 | 967 | | |
| |||
962 | 984 | | |
963 | 985 | | |
964 | 986 | | |
965 | | - | |
966 | | - | |
967 | | - | |
968 | | - | |
969 | | - | |
| 987 | + | |
970 | 988 | | |
971 | 989 | | |
972 | 990 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
0 commit comments