Skip to content

Support multi-threaded writing of Parquet files with modular encryption #7818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Jun 29, 2025

Closes #7359.

Rationale for this change

This is to enable concurrent column writing with encryption downstream (e.g. with datafusion). See #7359 for more.
See https://github.com/apache/arrow-rs/pull/7111/files#r2015196618

What changes are included in this PR?

  • ArrowRowGroupWriterFactory.create_row_group_writer now passes a FileEncryptor to ArrowColumnWriterFactory that can later be used to write columns concurrently.
  • Several methods and properties are now pub.
  • Minor change to how encryption tests read test data.

Are these changes tested?

Yes.

Are there any user-facing changes?

ArrowRowGroupWriterFactory, ArrowRowGroupWriter, ArrowRowGroupWriterFactory::new, ArrowRowGroupWriterFactory.create_row_group_writer , .. are now pub. Shouldn't be breaking but perhaps needs review.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 29, 2025
@rok rok force-pushed the multi-threaded_encrypted_writing branch from ee38fb5 to cd5196f Compare June 29, 2025 00:03
@rok rok requested a review from adamreeve June 29, 2025 00:04
@rok rok force-pushed the multi-threaded_encrypted_writing branch from cd5196f to c2ac2d9 Compare June 29, 2025 10:41
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like the approach of exposing the ArrowRowGroupWriterFactory to handle passing through the file encryption properties. We should probably also update the documentation on ArrowColumnWriter to encourage users to use this API, and/or add documentation to get_column_writers to encourage users to use this approach for compatibility with encryption.

There seem to be a lot of extra things that have been exposed publicly in this PR that don't need to be though, which makes it hard to follow what changes are actually needed. We should minimize how much is public so it's easier for users to follow documentation and see which APIs they should be using, and so that internal details can be changed later.

@rok rok force-pushed the multi-threaded_encrypted_writing branch 2 times, most recently from 86ddf45 to b35d078 Compare June 30, 2025 10:37
@rok rok force-pushed the multi-threaded_encrypted_writing branch from b35d078 to 2d42ca0 Compare June 30, 2025 10:42
@rok
Copy link
Member Author

rok commented Jun 30, 2025

Thanks for the quick review @adamreeve !
I've moved schemas and writer properties to ArrowRowGroupWriterFactory as you suggested.
Test also now uses tokio channels.

@rok rok force-pushed the multi-threaded_encrypted_writing branch from bc293de to 4009f42 Compare June 30, 2025 12:34
@rok rok force-pushed the multi-threaded_encrypted_writing branch from 4009f42 to b57a5d4 Compare June 30, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-threaded writing of Parquet files with modular encryption
2 participants