Upsert with NULL value not working #1835

Benjamin-Lemaire · 2025-03-24T09:37:13Z

Apache Iceberg version

None

Please describe the bug 🐞

The upsert function does not work when one of the two fields used for comparison is NULL. The diff_expr variable needs to be enhanced to handle NULL values in the fields being compared, as the expression 'value' != NULL will never evaluate to true.

Here is how I fixed it locally from module upsert_util.py

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

Fokko · 2025-03-24T11:54:43Z

@Benjamin-Lemaire I don't fully understand the issue? Is this about having a null-value in the column that you try to deduplicate on? Could you maybe write a test-case to reproduce it? That would be very helpful

Benjamin-Lemaire · 2025-03-24T13:43:50Z

Sure! If you execute this code below on your own AWS account using Glue and S3, it creates an Iceberg table using the Glue catalog and adds one row where the column bar is initially NULL. The second UPSERT operation is supposed to update that NULL to 7, but it doesn't work because the filter sent to the table to identify the difference is: 7 != NULL. This condition will never be true since, in SQL, comparisons involving NULLs do not yield true or false but instead return unknown.

import boto3
import pyarrow
import pandas
from pyiceberg.catalog import load_catalog

schema = pyarrow.schema([
    ('foo', pyarrow.string()),
    ('bar', pyarrow.int32()),  # 'bar' is nullable
    ('baz', pyarrow.bool_())
])

database = "--> YOUR GLUE DATABASE <--"
table = "mytable"
bucket = "--> YOUR S3 BUCKET <--"

# Initialize S3 client
s3 = boto3.client('s3', region_name='us-east-1')

# Load Catalog
catalog = load_catalog("default", **{"type": "glue"})
table_location = f"s3://{bucket}/{database}/{table}"

# Create table in Glue Catalog
table = catalog.create_table_if_not_exists(
    identifier=(database,table),
    schema=schema,
    location=table_location
)

# Upsert Data with NULL
data_with_null = [
    {"foo": "apple", "bar": None, "baz": False}, # Ensuring nullable fields
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_with_null),schema=schema)
table.upsert(data,join_cols=["foo"])
    
# Upsert Data without NULL
data_without_null = [
    {"foo": "apple", "bar": 7, "baz": False}, 
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_without_null),schema=schema)
table.upsert(data,join_cols=["foo"])

Benjamin-Lemaire · 2025-03-26T16:59:27Z

@Fokko , is it more clear with the example ?

kevinjqliu · 2025-03-31T17:34:53Z

Thanks for reporting the issue @Benjamin-Lemaire #1861 should resolve this and will be included in the upcoming 0.9.1 release

# Rationale for this change Closes #1835 Original implementation, `!=` ([not_equal](https://arrow.apache.org/docs/python/generated/pyarrow.compute.not_equal.html#pyarrow.compute.not_equal)) does not account for `null` values. It emits `null` when either sides are `null` The new implementation, first checks for `not_equal`. And on null values, returns `true` only if both sides are `null` Similar to apache/iceberg-rust#1045 # Are these changes tested? Yes # Are there any user-facing changes? No

kevinjqliu mentioned this issue Mar 30, 2025

fix upsert with null values #1861

Merged

kevinjqliu closed this as completed in #1861 Mar 31, 2025

kevinjqliu closed this as completed in d69a191 Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert with NULL value not working #1835

Upsert with NULL value not working #1835

Benjamin-Lemaire commented Mar 24, 2025

Fokko commented Mar 24, 2025

Benjamin-Lemaire commented Mar 24, 2025 •

edited

Loading

Benjamin-Lemaire commented Mar 26, 2025

kevinjqliu commented Mar 31, 2025

Upsert with NULL value not working #1835

Upsert with NULL value not working #1835

Comments

Benjamin-Lemaire commented Mar 24, 2025

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

Fokko commented Mar 24, 2025

Benjamin-Lemaire commented Mar 24, 2025 • edited Loading

Benjamin-Lemaire commented Mar 26, 2025

kevinjqliu commented Mar 31, 2025

Benjamin-Lemaire commented Mar 24, 2025 •

edited

Loading