Skip to content

Refactor Metadata in Transaction #1903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 18, 2025
Merged

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Apr 9, 2025

Rationale for this change

Today, we have a copy of the TableMetadata on the Table and the Transaction. This PR changes that logic to re-use the one on the table, and add the changes to the one on the Transaction.

This also allows us to stack changes, for example, to first change a schema, and then write data with the new schema right away.

Also a prerequisite for #1772

Are these changes tested?

Includes a new test :)

Are there any user-facing changes?

Today, we have a copy of the `TableMetadata` on the `Table` and
the `Transaction`. This PR changes that logic to re-use the one
on the table, and add the changes to the one on the `Transaction`.

This also allows us to stack changes, for example, to first change
a schema, and then write data with the new schema right away.

Also a prerequisite for apache#1772
@Fokko Fokko force-pushed the fd-refactor-transaction branch from 8642065 to 8fe2ecd Compare April 9, 2025 21:55
Comment on lines +262 to +264
@property
def table_metadata(self) -> TableMetadata:
return update_table_metadata(self._table.metadata, self._updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about performance implications of this change for when metadata gets large - with it, and assuming no autocommit, it seems that that each use of table_metadata will start with the original metadata of the table, not the current one of the transaction (which is what was done before), and apply all updates to it, not just the most recent one, copying it every time via model_copy?

I think what was here before, self.table_metadata = update_table_metadata(self.table_metadata, updates), only applies just the necessary updates within _apply, and stores results in a field along the way to continually update just the current transactional metadata. Because of that, continually using table_metadata, either via PyIceberg code or user code seemed cheap before but maybe no longer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably missing something though because I don't follow the change in behaviour caused by this change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your assessment is correct. The main issue that it tackles is that we remove the Metadata state on the Transaction layer. When we start implementing optimistic concurrency, before applying the commit, we could refresh the underlying table when we do a retry.

I think the code will have pretty decent performance since it will use Pydantic under the hood which delegates everything to their Rust layer, and also the singledispatch logic is also pretty performant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying and updating the test - sounds good!

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea @Fokko and I agree that the performance impact should be minimal. 💯

@Fokko Fokko merged commit 068ee5d into apache:main Apr 18, 2025
7 checks passed
@Fokko Fokko deleted the fd-refactor-transaction branch April 18, 2025 10:19
@Fokko
Copy link
Contributor Author

Fokko commented Apr 18, 2025

Awesome, thanks for the review @sungwy and @smaheshwar-pltr Let's move this forward 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants