Using compression on source code repositories clones? #1

rolinh · 2015-06-03T10:08:01Z

We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the GHTorrent project, we will probably have around 10mio repositories (and maybe even more).

Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.

Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (crawld will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (repotool and srctool hence language parsers mainly).

If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only crawld would compress data, all the other tools would only decompress, hence decompression speed is probably more important). Snappy or Zstandard are probably good candidates.

The text was updated successfully, but these errors were encountered:

rolinh added enhancement help wanted labels Jun 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using compression on source code repositories clones? #1

Using compression on source code repositories clones? #1

rolinh commented Jun 3, 2015

Using compression on source code repositories clones? #1

Using compression on source code repositories clones? #1

Comments

rolinh commented Jun 3, 2015