Skip to content

GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata#49926

Open
haziqishere wants to merge 4 commits intoapache:mainfrom
haziqishere:fix/display-bloom_filter_offset-in-ColumnChunkMetaData
Open

GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata#49926
haziqishere wants to merge 4 commits intoapache:mainfrom
haziqishere:fix/display-bloom_filter_offset-in-ColumnChunkMetaData

Conversation

@haziqishere
Copy link
Copy Markdown

@haziqishere haziqishere commented May 5, 2026

Rationale for this change

ColumnChunkMetaData.to_dict() method omits bloom_filter_offset and bloom_filter_length even when a bloom filter is written to the Parquet file. Users cannot programmatically verify bloom filter presence via the Python metadata API without resorting to file size comparison.

What changes are included in this PR?

  1. python/pyarrow/includes/libparquet.pxd: Declare bloom_filter_offset() and bloom_filter_length() (both optional[int64_t]) on CColumnChunkMetaData. This is to expose the existing C++ methods to Cython.
  2. python/pyarrow/_parquet.pyx: Add bloom_filter_offset and bloom_filter_length properties to ColumnChunkMetaData (returns int when set, None otherwise). Add both fields to to_dict() and __repr__.
  3. python/pyarrow/tests/parquet/test_metadata.py: Add test_bloom_filter_offset_in_metadata verifying that columns with a bloom filter expose non-None integer values and that to_dict() contains the keys, while columns without a bloom filter return None.

Are these changes tested?

Yes. test_bloom_filter_offset_in_metadata in test_metadata.py covers:

  • Column with bloom filter: bloom_filter_offset and bloom_filter_length are non-None integers
  • Column without bloom filter: both return None
  • Both keys present in to_dict() output
image

Here is closer look on the logic output:

image

output:

col_a bloom_filter_offset: 10699
col_a bloom_filter_length: 1040
col_b bloom_filter_offset: None
col_b bloom_filter_length: None

col_a to_dict(): {'file_offset': 0, 'file_path': '', 'physical_type': 'BYTE_ARRAY', 'num_values': 1000, 'path_in_schema': 'a', 'is_stats_set': True, 'statistics': {'has_min_max': True, 'min': 'id_0', 'max': 'id_999', 'null_count': 0, 'distinct_count': None, 'num_values': 1000, 'physical_type': 'BYTE_ARRAY'}, 'geo_statistics': None, 'compression': 'SNAPPY', 'encodings': ('PLAIN', 'RLE', 'RLE_DICTIONARY'), 'has_dictionary_page': True, 'dictionary_page_offset': 4, 'data_page_offset': 4035, 'total_compressed_size': 5336, 'total_uncompressed_size': 11208, 'bloom_filter_offset': 10699, 'bloom_filter_length': 1040}

@haziqishere haziqishere requested review from AlenkaF, raulcd and rok as code owners May 5, 2026 16:31
@github-actions github-actions Bot added the awaiting review Awaiting review label May 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@haziqishere haziqishere changed the title Fix/display bloom filter offset in column chunk metadata GH-49927: [C++] [Parquet] Fix/display bloom filter offset in column chunk metadata May 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

⚠️ GitHub issue #49927 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. Could you update the the title to something like the following to make it clear that this is about exposing this to Python not about a fix on C++?
GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata

Comment thread python/pyarrow/tests/parquet/test_metadata.py Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 6, 2026
@haziqishere haziqishere changed the title GH-49927: [C++] [Parquet] Fix/display bloom filter offset in column chunk metadata GH-49927: [Python][Parquet] Expose bloom_filter_offset and bloom_filter_length to Python in column chunk metadata May 6, 2026
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 6, 2026
@haziqishere
Copy link
Copy Markdown
Author

@raulcd thanks for reviewing and the feedback. I've updated the items respectively 😄

Copy link
Copy Markdown
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing this to PyArrow looks good to me.
@mapleFU @pitrou do you want to take a look?

@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants