-
Notifications
You must be signed in to change notification settings - Fork 272
add scan tests with null values #1865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
def test_scan_complements(catalog: InMemoryCatalog, arrow_table_with_null: pa.Table) -> None: | ||
from pyiceberg.expressions.visitors import bind | ||
from pyiceberg.io.pyarrow import _expression_to_complementary_pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_expression_to_complementary_pyarrow
explicitly calls out null handling
iceberg-python/pyiceberg/io/pyarrow.py
Lines 883 to 886 in d69a191
def _expression_to_complementary_pyarrow(expr: BooleanExpression) -> pc.Expression: | |
"""Complementary filter conversion function of expression_to_pyarrow. | |
Could not use expression_to_pyarrow(Not(expr)) to achieve this complementary effect because ~ in pyarrow.compute.Expression does not handle null. |
# this should be 2 | ||
assert len(table.scan(row_filter="string != 'a'").to_arrow()) == 1 | ||
assert len(table.scan(row_filter=NotEqualTo(term="string", literal=("a"))).to_arrow()) == 1 | ||
assert len(table.scan(row_filter=Not(EqualTo(term="string", literal=("a")))).to_arrow()) == 1 | ||
assert len(table.scan().to_arrow().filter(pc.field("string") != "a")) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko i think this might be a bug in how we handle nulls right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, this gets really messy. for example, if string != 'a'
should match null, what about ~(string != 'a')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks incorrect to me. ~(string != 'a')
should be rewritten to string = 'a'
, this happens in rewrite_not
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another options is:
table.scan().to_arrow().filter(pc.coalesce(pc.field("string") != "a", pc.scalar(False))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to remove the not
's in the beginning of the ScanPlan
.
Rationale for this change
Test scan with null values
Similar to apache/iceberg-rust#1045
Are these changes tested?
Are there any user-facing changes?