You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
a freshly collected contracts dataset out of cryo is 15.53 gb. if you consolidate 17,920 files into 17 files, it would become 7.88 gb, providing 2x savings on storage and 3x improvement in query performance.
i propose to add an option for cryo to incrementally consolidate the collected datasets, merging parts as a soon as they form a larger chunk that won't require any further rewriting.
an example of how it could work with --align option:
0-17,000,000 block range is consolidated into files of 1,000,000 blocks each
17,000,000-17,900,000 range is consolidated into files of 100,000 blocks each
17,900,000-17,920,000 range is consolidated into files of 10,000 blocks each
17,920,000-17,924,000 range is kept as collected with 1,000 blocks in each file
if we run again after block 17,930,000, blocks 17,920,000-17,930,000 would be consolidated into a bigger file
same would happen block 18,000,000 with chunks for block 17,900,000-18,000,000
adopting this approach would allow a set-and-forget cron job with cryo <dataset> --align --consolidate for a researcher to always come back to a fresh and performant dataset.
The text was updated successfully, but these errors were encountered:
banteg
changed the title
an option to consolidate a dataset
add option to consolidate a dataset
Aug 16, 2023
a freshly collected
contracts
dataset out of cryo is 15.53 gb. if you consolidate 17,920 files into 17 files, it would become 7.88 gb, providing 2x savings on storage and 3x improvement in query performance.i propose to add an option for cryo to incrementally consolidate the collected datasets, merging parts as a soon as they form a larger chunk that won't require any further rewriting.
an example of how it could work with
--align
option:adopting this approach would allow a set-and-forget cron job with
cryo <dataset> --align --consolidate
for a researcher to always come back to a fresh and performant dataset.The text was updated successfully, but these errors were encountered: