Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog file storage documentation & optimization #2086

Open
dodofarm opened this issue Dec 4, 2024 · 0 comments
Open

Catalog file storage documentation & optimization #2086

dodofarm opened this issue Dec 4, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@dodofarm
Copy link
Contributor

dodofarm commented Dec 4, 2024

Feature Request

As discussed on Discord it might be beneficial to document the optimal way of splitting various types of data across .parquet files in a catalog.

Right now write_data is just creating one big file for all the instruments passed to the method, which according to the above is not optimal and we should at least split the various instruments across multiple parquet files.

We should document what's the optimal way of splitting data for 1 instrument across multiple parquet files based on various criteria. E.g. if tick level data of one instrument is written to the catalog it should be split in a way that multiple parquet files are created for this one instrument and each parquet file shouldn't cross 10MB (this is a just a random value I'm using for an example). Another example: if storing 1s bars the maximum per parquet files should be 24 hours of data. etc. etc.

Additionally to make it easier for users to implement the best practices we can create some function parameter (e.g. optimize_parquet_files) for write_data that will do the heavy lifting for them and automatically split the data across multiple files in the best way possible. Not sure if setting this parameter to True by default is the best option but at least giving users that option to enable the optimization might be a good start.

@dodofarm dodofarm added the enhancement New feature or request label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant