Add delete file index to pyiceberg and support equality delete reads #2255

geruh · 2025-07-29T03:01:27Z

Summary

This work was primarily done by @rutb327 while I provided guidance!

This PR adds equality delete read support to PyIceberg by implementing the delete file indexing system that matches delete files to data files, mimicking the behavior found in Iceberg Core. With this implementation we are able to index files and now read equality deletes during table scans.

Design details

Delete File Index

The new DeleteFileIndex class centralizes handling of all delete file types: positional deletes, equality deletes, and deletion vectors. It organizes deletes by type (equality vs. positional), partition (using PartitionMap for spec-aware grouping), and path (for path-specific positional deletes). This enables efficient lookup during table scans, reducing unnecessary delete file processing.

Equality Delete support

Equality delete files are loaded as PyArrow Tables with their respective equality ids for the schema and for each we are grouping tables with the same set equality id's to reduce anti join operations.

Testing

Added tests from the core iceberg DeleteFileIndex test suite and added some tests with dummy files. As well as some manual testing with a flink setup.

table_eq with only equality deletes on id=2, id=5
+---+-------+
| id|   data|
+---+-------+
|  1|  Alice|
|  3|Charlie|
|  4|  David|
|  6|  Frank|
+---+-------+

table_eq_pos with equality deletes and positional delete at position 3
+---+-----+
| id| data|
+---+-----+
|  1|Alice|
|  4|David|
|  6|Frank|
+---+-----+

Are there any user-facing changes?

Yes can read tables with equality deletes

gabeiglio · 2025-07-31T13:35:33Z

I noticed that this PR addresses the same issue/feature as the one I was working on in here. However, your implementation is more complete (by supporting reading equality deletes and deletion vectors), so I think it makes sense to move forward with this one instead. (cc: @sungwy, since you reviewed my PR)

kevinjqliu · 2025-07-31T19:45:30Z

oops, sorry @gabeiglio, I was searching for positional deletes in github search and i didnt see that you were already working on it in that PR. Looks like there are some parts of the PR that is still super useful to get merged, like the validates.

gabeiglio · 2025-07-31T21:38:35Z

Yea exactly, should have been more clear on my message, my implementation for DeleteFileIndex was a scope creep to achieve the validation. so now that PR can be only for the validation instead of partition maps, delete file index, etc. :) @kevinjqliu

pyiceberg/io/pyarrow.py

sungwy

Hi @geruh - thanks for working on this PR, and sorry for the delayed review.

I've added some review feedback. Let me know your thoughts!

rutb327 · 2025-08-14T21:15:03Z

@sungwy Thanks a lot! I have done the suggested changes, could you take another look at it?

sungwy

Hi @rutb327 thank you for continuing to work on the PR!

I've added a few more suggestions after taking longer time reading your implementation and the test suite. Hope you find this helpful!

pyiceberg/table/delete_file_index.py

tests/table/test_delete_file_index.py

jayceslesar · 2025-09-22T21:06:20Z

pyiceberg/table/delete_file_index.py

+        if self.dv:
+            if not self.dv_sorted:
+                self.dv_values = sorted(self.dv.values(), key=lambda x: x[1])
+                self.dv_sorted = True


Instead of needing to track whether we are sorted here is there a better data structure that just is sorted that we could use?

We went with the lazy sort pattern here to follow the Java implementation. This allows us to add each file in O(1), then sort once in n log n. We technically could use sorted containers SortedList for roughly the same performance. WDYT

jayceslesar · 2025-09-23T17:13:24Z

Also I will test, but will this DeleteFileIndex class properly map deletes to data files? In the current implementation pyiceberg maps dangling delete files

KazuhitoT · 2025-11-20T11:55:00Z

Hi, I hope this PR can be included in 0.11.0.

In the current implementation pyiceberg maps dangling delete files

It seems that position delete files with referenced_data_file set (and no bounds) might not be covered by this logic.
I tried a small change in my fork here:
KazuhitoT@b0965d0

If this approach makes sense, would you consider adding a similar change to this PR?

Co-authored-by: Sung Yun <[email protected]>