-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol: add restriction on path appearing at more once in a snapshot list of add
files
#4058
Conversation
add restriction on path appearing at more once in a snapshot list of `add` files per discussion at delta-io#4021
Protocol: add restriction on a path appearing more than once in any snapshot list of `add` actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart-samwel @ryan-johnson-databricks @tdas Could you take a look. Having conditions across commit and snapshot is kind of tricky.
Co-authored-by: Lars Kroll <[email protected]>
@@ -488,6 +488,8 @@ That means specifically that for any commit… | |||
- it is **legal** for the same `path` to occur in an `add` action and a `remove` action, but with two different `dvId`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this requirement from L485 above seems incorrect?
for simplicity it is required that there is at most one file action of the same type for any
path
(regardless ofdvId
).
Wouldn't a "normal" remove/add pair that updates the dvId
violate that requirement?
The requirement might make sense for action reconciliation... but this is the spec for commit file content. Also, I believe CDC actions are also considered file actions, and it's legal to have an add and a cdc for the same path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says of the same type, and an add/remove counts as two types, so it's not a violation? Same for add and cdc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right 🤦 -- reading comprehension fail on my part, sorry!
Is the gate process broken? why should anything fail in checks if only a markdown file is changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. FWIW, in practice it will be very expensive to sanity check this correctness in actual clients. They will have to guarantee this "by construction", i.e., you can add a file that you just created yourself with a random UUID in the name and you can be sure that it won't be a duplicate, and you can add a file with a DV if you remove the same file with a different DV. And I think it's OK to rely on logic to get this right. I think the main point of the change is that clients are allowed to assume that this holds.
@@ -488,6 +488,8 @@ That means specifically that for any commit… | |||
- it is **legal** for the same `path` to occur in an `add` action and a `remove` action, but with two different `dvId`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says of the same type, and an add/remove counts as two types, so it's not a violation? Same for add and cdc.
Exactly the point, thanks. |
@@ -827,8 +827,8 @@ A given snapshot of the table can be computed by replaying the events committed | |||
- A single `metaData` action | |||
- A collection of `txn` actions with unique `appId`s | |||
- A collection of `domainMetadata` actions with unique `domain`s. | |||
- A collection of `add` actions with unique `(path, deletionVector.uniqueId)` keys. | |||
- A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path. In any given snapshot, it is illegal for the same path to occur more than once in `add` actions. | |||
- A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the second half of this sentence I suggested is redundant with L839 below. Maybe it's still helpful here, to distinguish clearly between the path-only part of the rules vs. normal path+DV pairs?
add restriction on path appearing at more once in a snapshot list of
add
files per discussion at 4021