add option to consolidate a dataset #29

banteg · 2023-08-16T01:24:00Z

a freshly collected contracts dataset out of cryo is 15.53 gb. if you consolidate 17,920 files into 17 files, it would become 7.88 gb, providing 2x savings on storage and 3x improvement in query performance.

i propose to add an option for cryo to incrementally consolidate the collected datasets, merging parts as a soon as they form a larger chunk that won't require any further rewriting.

an example of how it could work with --align option:

0-17,000,000 block range is consolidated into files of 1,000,000 blocks each
17,000,000-17,900,000 range is consolidated into files of 100,000 blocks each
17,900,000-17,920,000 range is consolidated into files of 10,000 blocks each
17,920,000-17,924,000 range is kept as collected with 1,000 blocks in each file
if we run again after block 17,930,000, blocks 17,920,000-17,930,000 would be consolidated into a bigger file
same would happen block 18,000,000 with chunks for block 17,900,000-18,000,000

adopting this approach would allow a set-and-forget cron job with cryo <dataset> --align --consolidate for a researcher to always come back to a fresh and performant dataset.

The text was updated successfully, but these errors were encountered:

banteg · 2023-08-24T12:44:14Z

implemented the logic i described here
https://github.com/banteg/cryogen

banteg changed the title ~~an option to consolidate a dataset~~ add option to consolidate a dataset Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add option to consolidate a dataset #29

add option to consolidate a dataset #29

banteg commented Aug 16, 2023 •

edited

Loading

banteg commented Aug 24, 2023

add option to consolidate a dataset #29

add option to consolidate a dataset #29

Comments

banteg commented Aug 16, 2023 • edited Loading

banteg commented Aug 24, 2023

banteg commented Aug 16, 2023 •

edited

Loading