feat: Support Parquet writer options #1123
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A.
Rationale for this change
Supporting all Parquet writer options allows us more flexibility when creating data directly from
datafusion-python
.For consistency, it supports all writer options defined by
ParquetOptions
indatafusion
, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.What changes are included in this PR?
write_parquet
with all writer options, including column-specific options.pyarrow
does not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)Are there any user-facing changes?
The main difference relates to the existing
compression
field, which now uses astr
likedatafusion
, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.Additionally, the default compression was changed from
zstd(4)
tozstd(3)
, the same asdatafusion
.