-
Notifications
You must be signed in to change notification settings - Fork 270
feat: validation_history
and ancestors_between
#1935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
validation_history
and ancestors_between
validation_history
and ancestors_between
validation_history
and ancestors_between
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jayceslesar thanks for working on this PR! I think having these helper functions will be wonderful in setting up the rest of the subtasks for success. I've left a few comments around the nullability of the snapshots, and the required behavior of ancestors_between
function
pyiceberg/table/snapshots.py
Outdated
|
||
|
||
def ancestors_between( | ||
current_snapshot: Optional[Snapshot], oldest_snapshot: Snapshot, table_metadata: TableMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't looks correct: the current_snapshot
is a required field oldest_snapshot
could be None and in that case, we also should provide the entire list of ancestors of the current_snapshot
I think the implementation that's in progress on this PR is closer to what we'd want: https://github.com/apache/iceberg-python/pull/533/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should now be resolved, thank you for the link to that
validation_history
and ancestors_between
validation_history
and ancestors_between
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jayceslesar
pyiceberg/table/update/validate.py
Outdated
def validation_history( | ||
table: Table, | ||
from_snapshot: Snapshot, | ||
to_snapshot: Optional[Snapshot], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making the API a bit more strict and simplify code along the way. I don't think this needs to be Optional
:
to_snapshot: Optional[Snapshot], | |
to_snapshot: Snapshot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I think the existing behavior of the Spark/Java API is for to check all snapshots if the from_snapshot
is Optional.
Should we mirror that behavior in PyIceberg, or just require that the validate_from_snapshot_id
be specified if isolation_level
is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought was that we could always widen the API and make it optional.
Looking at it more closely, only the SparkWriteConf
is optional, we only use the value if is != null
:
- https://github.com/apache/iceberg/blob/d7f8e5400519f84cd2af82d7769713b24e916809/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L317-L321
- https://github.com/apache/iceberg/blob/d7f8e5400519f84cd2af82d7769713b24e916809/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L361-L366
pyiceberg/table/update/validate.py
Outdated
for snapshot in ancestors_between(from_snapshot, to_snapshot, table.metadata): | ||
last_snapshot = snapshot | ||
summary = snapshot.summary | ||
if summary is None or summary.operation not in matching_operations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous to me, I don't think we want to skip over this if the summary is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the right thing here is to assume that it is an Overwrite
operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a case where summary would ever be None
? I can't seem to find any checks against that in the codebase except for https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/snapshots.py#L367 which seems to be checking against an empty dict (well mapping) and should also be checking the or is None
case imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now defaults to Operation.OVERWRITE
if summary is determined to be None
, but still wondering about above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary
can be None
if the Table format version is v1
according to the Iceberg Spec: https://iceberg.apache.org/spec/#snapshots
Instead of setting it as Operation.OVERWRITE
I think an alternative approach is to just throw an exception if summary
field is None with a helpful error message that explains that validation_history
cannot be generated in the absence of summary fields. Wdyt @Fokko @jayceslesar ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, implemented and added to the existing test to assure that we raise an error when that happens. Planning on cleaning up tests in the next MR when I will likely need to make some fixtures & parameterize
pyiceberg/table/update/validate.py
Outdated
List of manifest files and set of snapshots matching conditions | ||
""" | ||
manifests_files: list[ManifestFile] = [] | ||
snapshots: set[Snapshot] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java just returns ints
, we can return Snapshot
's as well, but with the set
we lose the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, I am happy to remove this being a set, I was just attempting to copy what made sense from java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now returns a unique list, happy to modify that to allow duplicates if wanted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, or return Set[int]
, similar to Java :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__hash__
dunder on Snapshot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, will just return the Set[int]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__hash__
dunder on Snapshot ?
I like that as well, but let's do that in a separate PR 👍
Now we're back at a set, we can remove the in
check below:
if snapshot not in snapshots:
snapshots.add(snapshot.snapshot_id)
…uired in `validation_history`
…s determined to be `None`
pyiceberg/table/update/validate.py
Outdated
if last_snapshot is None or last_snapshot.snapshot_id == from_snapshot.snapshot_id: | ||
raise ValidationException("No matching snapshot found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it like this for now, but it seems odd to throw here, since there is no history to validate, which in turn is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly me just not translating this java correctly at all https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L894
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
pyiceberg/table/update/validate.py
Outdated
] | ||
) | ||
|
||
if last_snapshot is None or last_snapshot.snapshot_id == from_snapshot.snapshot_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if last_snapshot is None or last_snapshot.snapshot_id == from_snapshot.snapshot_id: | |
if last_snapshot is not None and last_snapshot.snapshot_id != from_snapshot.snapshot_id: |
`if not (last_snapshot is not None and last_snapshot.snapshot_id != from_snapshot.snapshot_id):`
The ValidationCheck
in java throws an Exception with the provided error message if the boolean condition in doesn't hold. So we'd actually be throwing in the inverse of this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, changing this to be correct also caught an issue with ancestors_between
that is not fixed!
tests/table/test_validate.py
Outdated
with patch("pyiceberg.table.update.validate.ancestors_between", return_value=[snapshot_with_no_summary]): | ||
with pytest.raises(ValidationException): | ||
validation_history( | ||
table_v2_with_extensive_snapshots, | ||
newest_snapshot, | ||
oldest_snapshot, | ||
{Operation.APPEND}, | ||
ManifestContent.DATA, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to create a separate test function that tests this failure case, like test_validation_history_fails_on_snapshot_with_no_summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/table/test_validate.py
Outdated
from pyiceberg.table.update.validate import validation_history | ||
|
||
|
||
def test_validation_history(table_v2_with_extensive_snapshots: Table) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add another test which tests the expected failure when from_snapshot
value doesn't match a snapshot in the history of the table?
Hi @jayceslesar ! Thank you for pressing on with this PR! This looks almost good to merge. I share the same concerns with @Fokko regarding the validation check - I fear that it's currently not checking the right condition. Let's get that fix in, and merge this PR in soon! 🥳 |
Rationale for this change
Adds
validation_history
that will be used in support of #819Also adds
ancestors_between
.Based off of the java implementation but we will likely want changes that feel more like python :). Already took a few liberties there
Are these changes tested?
Added tests
References:
java implementation of
validation_history
:https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L859
java implementation of
ancestors_between
:https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L213