[Feature] - Allow Schema Overwrite #1726

mariotaddeucci · 2025-02-26T04:50:29Z

Feature Request / Improvement

It would be beneficial to introduce an option in pyiceberg to completely overwrite the schema as part of the table update process. By allowing users to fully replace the schema through a simple API, we can leverage the existing schema evolution mechanism to offer a more complete solution for schema management, specifically when overwrite a table.

Proposed Changes:

Introduce a new method that allows users to replace the entire table schema.

I am very interested in working on this feature if it is approved. Please let me know your thoughts and any potential guidelines or additional requirements for this contribution.

smaheshwar-pltr · 2025-02-26T13:31:28Z

+1 - I'd also like for schema replacement to be supported like this, unless I'm mistaken and it is already.

Without #433, a transactional "schema-replace-then-table-overwrite" gets us close to non-cumbersome (i.e. as opposed to individually changing columns) table replacement (but there are still differences vs Spark of course, like the snapshot log not being cleared). Edit: on second thought, maybe this isn't a strong argument and this should be a request for overwrite instead.

Fokko · 2025-03-02T18:43:14Z

Hey @mariotaddeucci Can you share more about the context around this functionality? Also, based on what I see in the implementation in #1727 I'm reluctant adding this.

In the PR, if you have a table with data, and you only replace the schema using the newly added API, it will return rows with all null values (since they are projected into the schema). If you want an API similar to Union by Name, where you update the fields based on the changes, then this makes sense to me. For example:

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, DoubleType
catalog = load_catalog()
initial_schema = Schema(
    NestedField(1, "city_name", StringType(), required=False),
    NestedField(2, "latitude", FloatType(), required=False),
    NestedField(3, "longitude", FloatType(), required=False),
)

table = catalog.create_table("default.locations", initial_schema)
new_schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
    NestedField(4, "population", LongType(), required=False),
)
with table.update_schema() as update:
    update.overwrite(new_schema)

# The big difference here is that if the field ID already exists, then we update the name/type/doc etc
assert new_schema == tbl.schema()

Code-wise, this would require a post-order traversal using a visitor-with-partner. A lot of inspiration can be taken from the UnionByName:

iceberg-python/pyiceberg/table/update/schema.py

Lines 143 to 153 in f186d58

    
           def union_by_name(self, new_schema: Union[Schema, "pa.Schema"]) -> UpdateSchema: 
        
               from pyiceberg.catalog import Catalog 
        
               visit_with_partner( 
        
                   Catalog._convert_schema_if_needed(new_schema), 
        
                   -1, 
        
                   _UnionByNameVisitor(update_schema=self, existing_schema=self._schema, case_sensitive=self._case_sensitive), 
        
                   # type: ignore 
        
                   PartnerIdByNameAccessor(partner_schema=self._schema, case_sensitive=self._case_sensitive), 
        
               ) 
        
               return self

mariotaddeucci mentioned this issue Feb 26, 2025

Add overwrite method to schema on schema update #1727

Open

smaheshwar-pltr mentioned this issue Mar 10, 2025

Support for REPLACE TABLE operation #433

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] - Allow Schema Overwrite #1726

[Feature] - Allow Schema Overwrite #1726

mariotaddeucci commented Feb 26, 2025

smaheshwar-pltr commented Feb 26, 2025 •

edited

Loading

Fokko commented Mar 2, 2025 •

edited

Loading

[Feature] - Allow Schema Overwrite #1726

[Feature] - Allow Schema Overwrite #1726

Comments

mariotaddeucci commented Feb 26, 2025

Feature Request / Improvement

smaheshwar-pltr commented Feb 26, 2025 • edited Loading

Fokko commented Mar 2, 2025 • edited Loading

smaheshwar-pltr commented Feb 26, 2025 •

edited

Loading

Fokko commented Mar 2, 2025 •

edited

Loading