You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the GHTorrent project, we will probably have around 10mio repositories (and maybe even more).
Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.
Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (crawld will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (repotool and srctool hence language parsers mainly).
If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only crawld would compress data, all the other tools would only decompress, hence decompression speed is probably more important). Snappy or Zstandard are probably good candidates.
The text was updated successfully, but these errors were encountered:
We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the GHTorrent project, we will probably have around 10mio repositories (and maybe even more).
Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.
Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (
crawld
will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (repotool
andsrctool
hence language parsers mainly).If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only
crawld
would compress data, all the other tools would only decompress, hence decompression speed is probably more important). Snappy or Zstandard are probably good candidates.The text was updated successfully, but these errors were encountered: