Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol: add restriction on path appearing at more once in a snapshot list of add files #4058

Merged
merged 6 commits into from
Jan 21, 2025

Conversation

zeevm
Copy link
Contributor

@zeevm zeevm commented Jan 15, 2025

add restriction on path appearing at more once in a snapshot list of add files per discussion at 4021

zeevm added 2 commits January 16, 2025 00:14
add restriction on path appearing at more once in a snapshot list of `add` files per discussion at delta-io#4021
Protocol: add restriction on a path appearing more than once in any snapshot list of `add` actions.
@zeevm
Copy link
Contributor Author

zeevm commented Jan 15, 2025

@larsk-db

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart-samwel @ryan-johnson-databricks @tdas Could you take a look. Having conditions across commit and snapshot is kind of tricky.

PROTOCOL.md Outdated Show resolved Hide resolved
Co-authored-by: Lars Kroll <[email protected]>
PROTOCOL.md Show resolved Hide resolved
@@ -488,6 +488,8 @@ That means specifically that for any commit…
- it is **legal** for the same `path` to occur in an `add` action and a `remove` action, but with two different `dvId`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this requirement from L485 above seems incorrect?

for simplicity it is required that there is at most one file action of the same type for any path (regardless of dvId).

Wouldn't a "normal" remove/add pair that updates the dvId violate that requirement?

The requirement might make sense for action reconciliation... but this is the spec for commit file content. Also, I believe CDC actions are also considered file actions, and it's legal to have an add and a cdc for the same path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says of the same type, and an add/remove counts as two types, so it's not a violation? Same for add and cdc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right 🤦 -- reading comprehension fail on my part, sorry!

PROTOCOL.md Outdated Show resolved Hide resolved
@zeevm
Copy link
Contributor Author

zeevm commented Jan 16, 2025

Is the gate process broken? why should anything fail in checks if only a markdown file is changed?

@zeevm
Copy link
Contributor Author

zeevm commented Jan 17, 2025

@larsk-db @scovich
What's the plan going forward with this change?

Copy link
Collaborator

@bart-samwel bart-samwel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. FWIW, in practice it will be very expensive to sanity check this correctness in actual clients. They will have to guarantee this "by construction", i.e., you can add a file that you just created yourself with a random UUID in the name and you can be sure that it won't be a duplicate, and you can add a file with a DV if you remove the same file with a different DV. And I think it's OK to rely on logic to get this right. I think the main point of the change is that clients are allowed to assume that this holds.

@@ -488,6 +488,8 @@ That means specifically that for any commit…
- it is **legal** for the same `path` to occur in an `add` action and a `remove` action, but with two different `dvId`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says of the same type, and an add/remove counts as two types, so it's not a violation? Same for add and cdc.

@zeevm
Copy link
Contributor Author

zeevm commented Jan 18, 2025

I think the main point of the change is that clients are allowed to assume that this holds.

Exactly the point, thanks.

@@ -827,8 +827,8 @@ A given snapshot of the table can be computed by replaying the events committed
- A single `metaData` action
- A collection of `txn` actions with unique `appId`s
- A collection of `domainMetadata` actions with unique `domain`s.
- A collection of `add` actions with unique `(path, deletionVector.uniqueId)` keys.
- A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path. In any given snapshot, it is illegal for the same path to occur more than once in `add` actions.
- A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the second half of this sentence I suggested is redundant with L839 below. Maybe it's still helpful here, to distinguish clearly between the path-only part of the rules vs. normal path+DV pairs?

@vkorukanti vkorukanti merged commit 7600e8a into delta-io:master Jan 21, 2025
14 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants