Skip to content

Conversation

@ForeverAngry
Copy link
Contributor

Closes #2649

Rationale for this change

Add support for using bloom filters in the read path of pyiceberg.

Are these changes tested?

Yes.

Are there any user-facing changes?

I dont think so.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations.
As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

Comment on lines +293 to +299
NestedField(
field_id=146,
name="bloom_filter_bytes",
field_type=MapType(key_id=147, key_type=IntegerType(), value_id=148, value_type=BinaryType()),
required=False,
doc="Map of column id to bloom filter",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot just add a field; this requires a spec change: https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals

@ForeverAngry
Copy link
Contributor Author

@Fokko I was kinda thinking that when I submitted it. But, it was done, and I figured id just send it to see if it sparked any interest.

That's good information though, sometimes I forget about the governance structures that exist for these projects.

@ForeverAngry
Copy link
Contributor Author

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations. As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add read support for parquet bloom filters

2 participants