Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've successfully created a proof-of-concept demonstrating that PyIceberg already supports writing equality delete files via transactions, even though the read path is not yet implemented.
What I Discovered
- DataFileContent.EQUALITY_DELETES enum
- equality_ids field in DataFile
- Snapshot tracking for equality deletes
- Manifest serialization
with table.transaction() as txn:
update_snapshot = txn.update_snapshot()
with update_snapshot.fast_append() as append_files:
append_files.append_data_file(delete_file) # Works for delete files!
Files Created
- Single equality delete file
- Multiple delete files with different equality_ids
- Basic usage (single column)
- Composite keys (multiple columns)
- Multiple delete files in one transaction
Test Results
All tests pass successfully:
test_add_equality_delete.py::test_add_equality_delete_file_via_transaction PASSED
test_add_equality_delete.py::test_add_multiple_equality_delete_files_with_different_equality_ids PASSED
====== 2 passed in 1.06s ======
Key Takeaways
The write path is production-ready. Users who generate equality delete files externally can add them to PyIceberg tables now, though they'll need other tools (like Spark) to read those tables.