-
Notifications
You must be signed in to change notification settings - Fork 274
Upsert with NULL value not working #1835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Benjamin-Lemaire I don't fully understand the issue? Is this about having a null-value in the column that you try to deduplicate on? Could you maybe write a test-case to reproduce it? That would be very helpful |
Sure! If you execute this code below on your own AWS account using Glue and S3, it creates an Iceberg table using the Glue catalog and adds one row where the column bar is initially NULL. The second UPSERT operation is supposed to update that NULL to 7, but it doesn't work because the filter sent to the table to identify the difference is: 7 != NULL. This condition will never be true since, in SQL, comparisons involving NULLs do not yield true or false but instead return unknown. import boto3
import pyarrow
import pandas
from pyiceberg.catalog import load_catalog
schema = pyarrow.schema([
('foo', pyarrow.string()),
('bar', pyarrow.int32()), # 'bar' is nullable
('baz', pyarrow.bool_())
])
database = "--> YOUR GLUE DATABASE <--"
table = "mytable"
bucket = "--> YOUR S3 BUCKET <--"
# Initialize S3 client
s3 = boto3.client('s3', region_name='us-east-1')
# Load Catalog
catalog = load_catalog("default", **{"type": "glue"})
table_location = f"s3://{bucket}/{database}/{table}"
# Create table in Glue Catalog
table = catalog.create_table_if_not_exists(
identifier=(database,table),
schema=schema,
location=table_location
)
# Upsert Data with NULL
data_with_null = [
{"foo": "apple", "bar": None, "baz": False}, # Ensuring nullable fields
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_with_null),schema=schema)
table.upsert(data,join_cols=["foo"])
# Upsert Data without NULL
data_without_null = [
{"foo": "apple", "bar": 7, "baz": False},
]
data = pyarrow.Table.from_pandas(pandas.DataFrame(data_without_null),schema=schema)
table.upsert(data,join_cols=["foo"]) |
@Fokko , is it more clear with the example ? |
Thanks for reporting the issue @Benjamin-Lemaire #1861 should resolve this and will be included in the upcoming 0.9.1 release |
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change Closes #1835 Original implementation, `!=` ([not_equal](https://arrow.apache.org/docs/python/generated/pyarrow.compute.not_equal.html#pyarrow.compute.not_equal)) does not account for `null` values. It emits `null` when either sides are `null` The new implementation, first checks for `not_equal`. And on null values, returns `true` only if both sides are `null` Similar to apache/iceberg-rust#1045 # Are these changes tested? Yes # Are there any user-facing changes? No <!-- In the case of user-facing changes, please add the changelog label. -->
Apache Iceberg version
None
Please describe the bug 🐞
The upsert function does not work when one of the two fields used for comparison is NULL. The diff_expr variable needs to be enhanced to handle NULL values in the fields being compared, as the expression 'value' != NULL will never evaluate to true.
Here is how I fixed it locally from module
upsert_util.py
Willingness to contribute
The text was updated successfully, but these errors were encountered: