Skip to content

Upsert with NULL value not working #1835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
Benjamin-Lemaire opened this issue Mar 24, 2025 · 4 comments · Fixed by #1861
Closed
3 tasks

Upsert with NULL value not working #1835

Benjamin-Lemaire opened this issue Mar 24, 2025 · 4 comments · Fixed by #1861

Comments

@Benjamin-Lemaire
Copy link

Apache Iceberg version

None

Please describe the bug 🐞

The upsert function does not work when one of the two fields used for comparison is NULL. The diff_expr variable needs to be enhanced to handle NULL values in the fields being compared, as the expression 'value' != NULL will never evaluate to true.

Here is how I fixed it locally from module upsert_util.py

Image

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@Fokko
Copy link
Contributor

Fokko commented Mar 24, 2025

@Benjamin-Lemaire I don't fully understand the issue? Is this about having a null-value in the column that you try to deduplicate on? Could you maybe write a test-case to reproduce it? That would be very helpful

@Benjamin-Lemaire
Copy link
Author

Benjamin-Lemaire commented Mar 24, 2025

Sure! If you execute this code below on your own AWS account using Glue and S3, it creates an Iceberg table using the Glue catalog and adds one row where the column bar is initially NULL. The second UPSERT operation is supposed to update that NULL to 7, but it doesn't work because the filter sent to the table to identify the difference is: 7 != NULL. This condition will never be true since, in SQL, comparisons involving NULLs do not yield true or false but instead return unknown.

import boto3
import pyarrow
import pandas
from pyiceberg.catalog import load_catalog

schema = pyarrow.schema([
    ('foo', pyarrow.string()),
    ('bar', pyarrow.int32()),  # 'bar' is nullable
    ('baz', pyarrow.bool_())
])

database = "--> YOUR GLUE DATABASE <--"
table = "mytable"
bucket = "--> YOUR S3 BUCKET <--"

# Initialize S3 client
s3 = boto3.client('s3', region_name='us-east-1')

# Load Catalog
catalog = load_catalog("default", **{"type": "glue"})
table_location = f"s3://{bucket}/{database}/{table}"

# Create table in Glue Catalog
table = catalog.create_table_if_not_exists(
    identifier=(database,table),
    schema=schema,
    location=table_location
)

# Upsert Data with NULL
data_with_null = [
    {"foo": "apple", "bar": None, "baz": False}, # Ensuring nullable fields
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_with_null),schema=schema)
table.upsert(data,join_cols=["foo"])
    
# Upsert Data without NULL
data_without_null = [
    {"foo": "apple", "bar": 7, "baz": False}, 
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_without_null),schema=schema)
table.upsert(data,join_cols=["foo"])

@Benjamin-Lemaire
Copy link
Author

@Fokko , is it more clear with the example ?

@kevinjqliu
Copy link
Contributor

Thanks for reporting the issue @Benjamin-Lemaire #1861 should resolve this and will be included in the upcoming 0.9.1 release

Fokko pushed a commit that referenced this issue Apr 17, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change
Closes #1835

Original implementation, `!=`
([not_equal](https://arrow.apache.org/docs/python/generated/pyarrow.compute.not_equal.html#pyarrow.compute.not_equal))
does not account for `null` values. It emits `null` when either sides
are `null`
The new implementation, first checks for `not_equal`. And on null
values, returns `true` only if both sides are `null`

Similar to apache/iceberg-rust#1045

# Are these changes tested?
Yes

# Are there any user-facing changes?
No

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants