-
Notifications
You must be signed in to change notification settings - Fork 270
Refactor Metadata
in Transaction
#1903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Today, we have a copy of the `TableMetadata` on the `Table` and the `Transaction`. This PR changes that logic to re-use the one on the table, and add the changes to the one on the `Transaction`. This also allows us to stack changes, for example, to first change a schema, and then write data with the new schema right away. Also a prerequisite for apache#1772
8642065
to
8fe2ecd
Compare
@property | ||
def table_metadata(self) -> TableMetadata: | ||
return update_table_metadata(self._table.metadata, self._updates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about performance implications of this change for when metadata gets large - with it, and assuming no autocommit
, it seems that that each use of table_metadata
will start with the original metadata of the table, not the current one of the transaction (which is what was done before), and apply all updates to it, not just the most recent one, copying it every time via model_copy
?
I think what was here before, self.table_metadata = update_table_metadata(self.table_metadata, updates)
, only applies just the necessary updates within _apply
, and stores results in a field along the way to continually update just the current transactional metadata. Because of that, continually using table_metadata
, either via PyIceberg code or user code seemed cheap before but maybe no longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Probably missing something though because I don't follow the change in behaviour caused by this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your assessment is correct. The main issue that it tackles is that we remove the Metadata
state on the Transaction layer. When we start implementing optimistic concurrency, before applying the commit, we could refresh the underlying table when we do a retry.
I think the code will have pretty decent performance since it will use Pydantic under the hood which delegates everything to their Rust layer, and also the singledispatch
logic is also pretty performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying and updating the test - sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea @Fokko and I agree that the performance impact should be minimal. 💯
Awesome, thanks for the review @sungwy and @smaheshwar-pltr Let's move this forward 👍 |
Rationale for this change
Today, we have a copy of the
TableMetadata
on theTable
and theTransaction
. This PR changes that logic to re-use the one on the table, and add the changes to the one on theTransaction
.This also allows us to stack changes, for example, to first change a schema, and then write data with the new schema right away.
Also a prerequisite for #1772
Are these changes tested?
Includes a new test :)
Are there any user-facing changes?