Define the archive format #9

killercup · 2018-12-08T16:03:18Z

Let's define the format of out archives.

Current state

A binary file that is actually just concatenated gzip blobs.

Features:

Extract gzip files
Append is trivial

Prior art

WARC. An implementation seems to live here.
- I have never used this, but someone pointed it out on Twitter.
- It's a long spec
- Not sure append is possible
.tar.gz files
- It's well-known
- It's from the 70s with all the 'features' that come with it
- Append?

What I learned: GZIP members

While reading the WARC spec I found this interesting section:

As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid GZIP file consists of any number of GZIP “members”, each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record’s starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

I did not know this about gzip! If I'm reading this correctly, it means that we can, in theory use files compatible with tar (or WARC) with the additional requirement that each file is a new GZIP member (so that we can continue to get slices from our index file that point to valid gzip files we can serve).

Options

continue to use custom archive format, but specify it, and maybe add some stuff to sure forward-compatibility
use tar, and find a way to ensure gzip members are used
- research how to do append, or skip appending as an update strategy altogether

cc @QuietMisdreavus

QuietMisdreavus · 2018-12-08T16:44:08Z

It's worth noting that rustdoc doesn't just want to append to an archive, but also to update files that already exist in the archive...

For context: When Cargo runs a cargo doc command, it invokes rustdoc multiple times on the same output directory, once for each dependency. This allows it to update a handful of shared files - the search index, the new source files index, the shared CSS/JS/font resources - so that the whole dependency tree can act like a single unit. The important piece here is that we need to be able to read in the existing search index (for example), add in the records for the crate being documented, and save it back into the archive.

If i understand the current format correctly (note: have not done any actual reading on it) this could be as trivial as removing it from the current archive, modifying it in-memory, then saving it on the end and updating the index appropriately. But if static-filez goes to a format where the files are going to be more interleaved, that will be more difficult. (It sounds like that's not going to happen, but it's worth noting.)

killercup · 2018-12-08T17:07:15Z

A quick way to "support" this is to just append the overwritten files and have the index point at the last version only.

…

On Sat, 8 Dec 2018, 17:44 QuietMisdreavus, ***@***.***> wrote: It's worth noting that rustdoc doesn't just want to append to an archive, but also to update files that already exist in the archive... For context: When Cargo runs a cargo doc command, it invokes rustdoc multiple times on the same output directory, once for each dependency. This allows it to update a handful of shared files - the search index, the new source files index, the shared CSS/JS/font resources - so that the whole dependency tree can act like a single unit. The important piece here is that we need to be able to read in the existing search index (for example), add in the records for the crate being documented, and save it back into the archive. If i understand the current format correctly (note: have not done any actual reading on it) this could be as trivial as removing it from the current archive, modifying it in-memory, then saving it on the end and updating the index appropriately. But if static-filez goes to a format where the files are going to be more interleaved, that will be more difficult. (It sounds like that's not going to happen, but it's worth noting.) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABOXzHHhbzJ2XrK9zJlnIP2VZG5BhAQks5u2-xYgaJpZM4ZJsm-> .

lnicola · 2019-02-13T15:25:21Z

Why not ZIP? .tgz is a pretty poor format for random access, and would probably require an external index.

killercup · 2019-02-13T15:58:28Z

Does a zip archive allow us to get files out of it as individual gzip streams so we can send them without extracting and re-compressing?

lnicola · 2019-02-13T16:08:36Z

The format itself should allow you to get a deflated stream directly out of the archive. You can test this with zip foo.zip foo.txt and zlib-flate -compress < foo.txt > foo.d, then by looking at both with a hex editor. foo.zip will have an extra header and footer, but the compressed data is identical. I don't know if the zip crate allows that, but it sounds like useful functionality that could reasonably be added to it.

Wrt. browser support, both Firefox and Chrome send Accept-Encoding: gzip, deflate.

In any case, I don't expect a documentation browser to get thousands of requests per second.

lnicola · 2019-02-13T16:16:17Z

Yes, it can:

lnicola · 2019-02-13T17:54:24Z

Another thing to consider is that if you're just browsing the docs on your computer, you might as send the files to the browser without compression. And if you want to host your crate's documentation somewhere, static file hosting is probably more accessible than a VPS or something that can run code.

I'm not sure what other use cases you're thinking of. Being able to serve compressed content might ultimately be a nice feature, but wouldn't really matter.

killercup · 2019-02-13T17:59:09Z

Interesting. My main concern with this crate is making a *very* efficient way to store and serve compressed data, and while the motivation is the use with rustdoc ideally it doesn't end there. So, when we choose a new archive format I wouldn't want it to have worse performance than the ad-hoc solution we have right now; it should only add compatibility -- either with existing applications or future versions/features of this/rustdoc.

…

On Wed, 13 Feb 2019, 18:54 Laurențiu Nicola, ***@***.***> wrote: Another thing to consider is that if you're just browsing the docs on your computer, you might as send the files to the browser without compression. And if you want to host your crate's documentation somewhere, static file hosting is probably more accessible than a VPS or something that can run code. I'm not sure what other use cases you're thinking of. Being able to serve compressed content might ultimately be a nice feature, but wouldn't really matter. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABOX8wnfBZfb8671zsHscoIks52rgOMks5vNFFRgaJpZM4ZJsm-> .

lnicola · 2019-02-13T18:01:46Z

Fair enough, there's nothing bad in wanting it to be as fast as possible.

Nemo157 · 2020-04-17T11:05:56Z

It would be nice to support alternative compression formats, brotli/zstd would both be useful as they compress html better than gzip. Maybe the index could record a global or per-file format, and maybe even support multiple formats to allow the server to negotiate which to serve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the archive format #9

Define the archive format #9

killercup commented Dec 8, 2018

QuietMisdreavus commented Dec 8, 2018

killercup commented Dec 8, 2018 via email

lnicola commented Feb 13, 2019

killercup commented Feb 13, 2019

lnicola commented Feb 13, 2019

lnicola commented Feb 13, 2019

lnicola commented Feb 13, 2019

killercup commented Feb 13, 2019 via email

lnicola commented Feb 13, 2019

Nemo157 commented Apr 17, 2020

Define the archive format #9

Define the archive format #9

Comments

killercup commented Dec 8, 2018

Current state

Prior art

What I learned: GZIP members

Options

QuietMisdreavus commented Dec 8, 2018

killercup commented Dec 8, 2018 via email

lnicola commented Feb 13, 2019

killercup commented Feb 13, 2019

lnicola commented Feb 13, 2019

lnicola commented Feb 13, 2019

lnicola commented Feb 13, 2019

killercup commented Feb 13, 2019 via email

lnicola commented Feb 13, 2019

Nemo157 commented Apr 17, 2020