Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol: add restriction on path appearing at more once in a snapshot list of add files #4058

Merged
merged 6 commits into from
Jan 21, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,8 @@ That means specifically that for any commit…
- it is **legal** for the same `path` to occur in an `add` action and a `remove` action, but with two different `dvId`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this requirement from L485 above seems incorrect?

for simplicity it is required that there is at most one file action of the same type for any path (regardless of dvId).

Wouldn't a "normal" remove/add pair that updates the dvId violate that requirement?

The requirement might make sense for action reconciliation... but this is the spec for commit file content. Also, I believe CDC actions are also considered file actions, and it's legal to have an add and a cdc for the same path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says of the same type, and an add/remove counts as two types, so it's not a violation? Same for add and cdc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right 🤦 -- reading comprehension fail on my part, sorry!

- it is **legal** for the same `path` to be added and/or removed and also occur in a `cdc` action.
- it is **illegal** for the same `path` to be occur twice with different `dvId`s within each set of `add` or `remove` actions.
- it is **illegal** for a `path` to occur in an `add` action that already occurs with a different `dvId` in the list of `add` actions from the snapshot of the version immediately preceeding the commit, unless the commit also contains a remove for the later combination.
- it is **legal** to commit an existing `path` and `dvId` combination again (this allows metadata updates).
zeevm marked this conversation as resolved.
Show resolved Hide resolved

The `dataChange` flag on either an `add` or a `remove` can be set to `false` to indicate that an action when combined with other actions in the same atomic version only rearranges existing data or adds new statistics.
For example, streaming queries that are tailing the transaction log can use this flag to skip actions that would not affect the final results.
Expand Down Expand Up @@ -825,7 +827,7 @@ A given snapshot of the table can be computed by replaying the events committed
- A single `metaData` action
- A collection of `txn` actions with unique `appId`s
- A collection of `domainMetadata` actions with unique `domain`s.
- A collection of `add` actions with unique `(path, deletionVector.uniqueId)` keys.
- A collection of `add` actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the second half of this sentence I suggested is redundant with L839 below. Maybe it's still helpful here, to distinguish clearly between the path-only part of the rules vs. normal path+DV pairs?

- A collection of `remove` actions with unique `(path, deletionVector.uniqueId)` keys. The intersection of the primary keys in the `add` collection and `remove` collection must be empty. That means a logical file cannot exist in both the `remove` and `add` collections at the same time; however, the same *data file* can exist with *different* DVs in the `remove` collection, as logically they represent different content. The `remove` actions act as _tombstones_, and only exist for the benefit of the VACUUM command. Snapshot reads only return `add` actions on the read path.

To achieve the requirements above, related actions from different delta files need to be reconciled with each other:
Expand Down
Loading