Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using compression on source code repositories clones? #1

Open
rolinh opened this issue Jun 3, 2015 · 0 comments
Open

Using compression on source code repositories clones? #1

rolinh opened this issue Jun 3, 2015 · 0 comments

Comments

@rolinh
Copy link
Member

rolinh commented Jun 3, 2015

We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the GHTorrent project, we will probably have around 10mio repositories (and maybe even more).

Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.

Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (crawld will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (repotool and srctool hence language parsers mainly).

If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only crawld would compress data, all the other tools would only decompress, hence decompression speed is probably more important). Snappy or Zstandard are probably good candidates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant