-
Notifications
You must be signed in to change notification settings - Fork 229
Iceberg-rust Delete support #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd like to help with this. I will send a PR about Strict projection later. |
I would also like to look into this, I will probably be working on the Strict Metrics Evaluator. |
liurenjie1024
pushed a commit
that referenced
this issue
Feb 26, 2025
part of #735. Added `StrictMetricsEvaluator`. --------- Co-authored-by: Fokko Driesprong <[email protected]>
liurenjie1024
added a commit
that referenced
this issue
Apr 15, 2025
This PR is part of #735. The implementation refers pyiceberg. Most of the code is test migrate from: https://github.com/apache/iceberg-python/blob/main/tests/test_transforms.py#L997 --------- Co-authored-by: ZENOTME <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Renjie Liu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For the deletes, we need a broader discussion on where the responsibilities lie between iceberg-rust and the query engine.
On the read-side there Tasks are passed to the query engine. I like this nice and clean boundary between the engine and the library. I would love to go to a similar API for deletes. Similar to the read path, the library comes up with a set of tasks that are passed back to the query engine to write out the files and return the DataFile with all the statistics and such.
The current focus of #700 is adding DataFiles, which is reasonable for engines to take control over. As a next step, we need to add delete operations. Here it gets more complicated since it can be that the delete can be performed purely on Iceberg metadata (eg. dropping a partition), but it can also be that certain Parquet files need to be rewritten. In such a case, the old DataFile will be dropped, and one or more DataFiles will be added when the engines have rewritten the Parquet files, excluding the rows that need to be dropped.
When doing a delete, the following steps are being taken:
As you might notice from above, this is pretty similar to the read path. Except, we need to invert the evaluators. For the read path, we check for
ROWS_MIGHT_MATCH
to include it in the query plan. For the delete use-case, we need to determine the opposite, namelyROWS_CANNOT_MATCH
. Therefore we need to extend the evaluators:Once this is ready, we can incorporate this into the write path, and also easily add the update operation (append + delete).
The text was updated successfully, but these errors were encountered: