Skip to content

add scan tests with null values #1865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Contributor

Rationale for this change

Test scan with null values

Similar to apache/iceberg-rust#1045

Are these changes tested?

Are there any user-facing changes?


def test_scan_complements(catalog: InMemoryCatalog, arrow_table_with_null: pa.Table) -> None:
from pyiceberg.expressions.visitors import bind
from pyiceberg.io.pyarrow import _expression_to_complementary_pyarrow
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_expression_to_complementary_pyarrow explicitly calls out null handling

def _expression_to_complementary_pyarrow(expr: BooleanExpression) -> pc.Expression:
"""Complementary filter conversion function of expression_to_pyarrow.
Could not use expression_to_pyarrow(Not(expr)) to achieve this complementary effect because ~ in pyarrow.compute.Expression does not handle null.

Comment on lines +2349 to +2353
# this should be 2
assert len(table.scan(row_filter="string != 'a'").to_arrow()) == 1
assert len(table.scan(row_filter=NotEqualTo(term="string", literal=("a"))).to_arrow()) == 1
assert len(table.scan(row_filter=Not(EqualTo(term="string", literal=("a")))).to_arrow()) == 1
assert len(table.scan().to_arrow().filter(pc.field("string") != "a")) == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko i think this might be a bug in how we handle nulls right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, this gets really messy. for example, if string != 'a' should match null, what about ~(string != 'a')?

Copy link
Contributor

@Fokko Fokko Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks incorrect to me. ~(string != 'a') should be rewritten to string = 'a', this happens in rewrite_not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another options is:

table.scan().to_arrow().filter(pc.coalesce(pc.field("string") != "a", pc.scalar(False))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to remove the not's in the beginning of the ScanPlan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants