Skip to content

[Feature] - Allow Schema Overwrite #1726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mariotaddeucci opened this issue Feb 26, 2025 · 2 comments
Open

[Feature] - Allow Schema Overwrite #1726

mariotaddeucci opened this issue Feb 26, 2025 · 2 comments

Comments

@mariotaddeucci
Copy link

Feature Request / Improvement

It would be beneficial to introduce an option in pyiceberg to completely overwrite the schema as part of the table update process. By allowing users to fully replace the schema through a simple API, we can leverage the existing schema evolution mechanism to offer a more complete solution for schema management, specifically when overwrite a table.

Proposed Changes:

  • Introduce a new method that allows users to replace the entire table schema.

I am very interested in working on this feature if it is approved. Please let me know your thoughts and any potential guidelines or additional requirements for this contribution.

@smaheshwar-pltr
Copy link
Contributor

smaheshwar-pltr commented Feb 26, 2025

+1 - I'd also like for schema replacement to be supported like this, unless I'm mistaken and it is already.

Without #433, a transactional "schema-replace-then-table-overwrite" gets us close to non-cumbersome (i.e. as opposed to individually changing columns) table replacement (but there are still differences vs Spark of course, like the snapshot log not being cleared). Edit: on second thought, maybe this isn't a strong argument and this should be a request for overwrite instead.

@Fokko
Copy link
Contributor

Fokko commented Mar 2, 2025

Hey @mariotaddeucci Can you share more about the context around this functionality? Also, based on what I see in the implementation in #1727 I'm reluctant adding this.

In the PR, if you have a table with data, and you only replace the schema using the newly added API, it will return rows with all null values (since they are projected into the schema). If you want an API similar to Union by Name, where you update the fields based on the changes, then this makes sense to me. For example:

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType
catalog = load_catalog()
initial_schema = Schema(
    NestedField(1, "city_name", StringType(), required=False),
    NestedField(2, "latitude", FloatType(), required=False),
    NestedField(3, "longitude", FloatType(), required=False),
)

table = catalog.create_table("default.locations", initial_schema)
new_schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
    NestedField(4, "population", LongType(), required=False),
)
with table.update_schema() as update:
    update.overwrite(new_schema)

# The big difference here is that if the field ID already exists, then we update the name/type/doc etc
assert new_schema == tbl.schema()

Code-wise, this would require a post-order traversal using a visitor-with-partner. A lot of inspiration can be taken from the UnionByName:

def union_by_name(self, new_schema: Union[Schema, "pa.Schema"]) -> UpdateSchema:
from pyiceberg.catalog import Catalog
visit_with_partner(
Catalog._convert_schema_if_needed(new_schema),
-1,
_UnionByNameVisitor(update_schema=self, existing_schema=self._schema, case_sensitive=self._case_sensitive),
# type: ignore
PartnerIdByNameAccessor(partner_schema=self._schema, case_sensitive=self._case_sensitive),
)
return self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants