Skip to content

feat: bloom filter pushdown#2398

Open
xanderbailey wants to merge 6 commits intoapache:mainfrom
xanderbailey:xb/bloom_filter_pushdown
Open

feat: bloom filter pushdown#2398
xanderbailey wants to merge 6 commits intoapache:mainfrom
xanderbailey:xb/bloom_filter_pushdown

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Adds bloom filter pushdown for equality predicates during Parquet reads. When enabled, the reader loads bloom filters from row group column chunks and uses them to skip row groups that definitely don't contain the queried values.

Key points:

  • New bloom_filter_enabled option on TableScanBuilder and ArrowReaderBuilder (off by default since it requires extra I/O per column per row group)
  • Only loads bloom filters for columns referenced in eq or in predicates — range predicates and other operators are ignored

Are these changes tested?

  • Unit tests covering the bloom filter evaluator: eq/in present/absent, AND/OR/NOT logic, all decimal physical types (INT32, INT64, FIXED_LEN_BYTE_ARRAY), negative values, missing bloom filters etc
  • Integration tests writing multi-row-group Parquet files with bloom filters enabled and verifying end-to-end row group pruning

/// against them to filter out row groups that definitely don't match.
async fn filter_row_groups_by_bloom_filter(
predicate: &crate::expr::BoundPredicate,
builder: &mut ParquetRecordBatchStreamBuilder<ArrowFileReader>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mut reference is because get_row_group_column_bloom_filter requires it.

CTTY pushed a commit that referenced this pull request May 6, 2026
…-endian (#2397)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.
Found this whilst working on
#2398
## What changes are included in this PR?

[Spec](https://iceberg.apache.org/spec/#binary-single-value-serialization)
says `Int128` and `UInt128` are big-endian not little-endian and indeed
we are using big-endian
[here](https://github.com/apache/iceberg-rust/blob/c1538de36dd53e491299b62ad89286f2db496bc7/crates/iceberg/src/arrow/schema.rs#L761)
for example. I think it's just the doc string which needs correcting.
<!--
Provide a summary of the modifications in this PR. List the main changes
such as new features, bug fixes, refactoring, or any other updates.
-->

## Are these changes tested?

<!--
Specify what test covers (unit test, integration test, etc.).

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant