You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed on Discord it might be beneficial to document the optimal way of splitting various types of data across .parquet files in a catalog.
Right now write_data is just creating one big file for all the instruments passed to the method, which according to the above is not optimal and we should at least split the various instruments across multiple parquet files.
We should document what's the optimal way of splitting data for 1 instrument across multiple parquet files based on various criteria. E.g. if tick level data of one instrument is written to the catalog it should be split in a way that multiple parquet files are created for this one instrument and each parquet file shouldn't cross 10MB (this is a just a random value I'm using for an example). Another example: if storing 1s bars the maximum per parquet files should be 24 hours of data. etc. etc.
Additionally to make it easier for users to implement the best practices we can create some function parameter (e.g. optimize_parquet_files) for write_data that will do the heavy lifting for them and automatically split the data across multiple files in the best way possible. Not sure if setting this parameter to True by default is the best option but at least giving users that option to enable the optimization might be a good start.
The text was updated successfully, but these errors were encountered:
Feature Request
As discussed on Discord it might be beneficial to document the optimal way of splitting various types of data across
.parquet
files in a catalog.Right now
write_data
is just creating one big file for all the instruments passed to the method, which according to the above is not optimal and we should at least split the various instruments across multiple parquet files.We should document what's the optimal way of splitting data for 1 instrument across multiple parquet files based on various criteria. E.g. if tick level data of one instrument is written to the catalog it should be split in a way that multiple parquet files are created for this one instrument and each parquet file shouldn't cross 10MB (this is a just a random value I'm using for an example). Another example: if storing 1s bars the maximum per parquet files should be 24 hours of data. etc. etc.
Additionally to make it easier for users to implement the best practices we can create some function parameter (e.g.
optimize_parquet_files
) forwrite_data
that will do the heavy lifting for them and automatically split the data across multiple files in the best way possible. Not sure if setting this parameter toTrue
by default is the best option but at least giving users that option to enable the optimization might be a good start.The text was updated successfully, but these errors were encountered: