Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the archive format #9

Open
killercup opened this issue Dec 8, 2018 · 10 comments
Open

Define the archive format #9

killercup opened this issue Dec 8, 2018 · 10 comments

Comments

@killercup
Copy link
Owner

Let's define the format of out archives.

Current state

A binary file that is actually just concatenated gzip blobs.

Features:

  1. Extract gzip files
  2. Append is trivial

Prior art

  • WARC. An implementation seems to live here.
    • I have never used this, but someone pointed it out on Twitter.
    • It's a long spec
    • Not sure append is possible
  • .tar.gz files
    • It's well-known
    • It's from the 70s with all the 'features' that come with it
    • Append?

What I learned: GZIP members

While reading the WARC spec I found this interesting section:

As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid GZIP file consists of any number of GZIP “members”, each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record’s starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

I did not know this about gzip! If I'm reading this correctly, it means that we can, in theory use files compatible with tar (or WARC) with the additional requirement that each file is a new GZIP member (so that we can continue to get slices from our index file that point to valid gzip files we can serve).

Options

  • continue to use custom archive format, but specify it, and maybe add some stuff to sure forward-compatibility
  • use tar, and find a way to ensure gzip members are used
    • research how to do append, or skip appending as an update strategy altogether

cc @QuietMisdreavus

@QuietMisdreavus
Copy link

It's worth noting that rustdoc doesn't just want to append to an archive, but also to update files that already exist in the archive...

For context: When Cargo runs a cargo doc command, it invokes rustdoc multiple times on the same output directory, once for each dependency. This allows it to update a handful of shared files - the search index, the new source files index, the shared CSS/JS/font resources - so that the whole dependency tree can act like a single unit. The important piece here is that we need to be able to read in the existing search index (for example), add in the records for the crate being documented, and save it back into the archive.

If i understand the current format correctly (note: have not done any actual reading on it) this could be as trivial as removing it from the current archive, modifying it in-memory, then saving it on the end and updating the index appropriately. But if static-filez goes to a format where the files are going to be more interleaved, that will be more difficult. (It sounds like that's not going to happen, but it's worth noting.)

@killercup
Copy link
Owner Author

killercup commented Dec 8, 2018 via email

@lnicola
Copy link

lnicola commented Feb 13, 2019

Why not ZIP? .tgz is a pretty poor format for random access, and would probably require an external index.

@killercup
Copy link
Owner Author

Does a zip archive allow us to get files out of it as individual gzip streams so we can send them without extracting and re-compressing?

@lnicola
Copy link

lnicola commented Feb 13, 2019

The format itself should allow you to get a deflated stream directly out of the archive. You can test this with zip foo.zip foo.txt and zlib-flate -compress < foo.txt > foo.d, then by looking at both with a hex editor. foo.zip will have an extra header and footer, but the compressed data is identical. I don't know if the zip crate allows that, but it sounds like useful functionality that could reasonably be added to it.

Wrt. browser support, both Firefox and Chrome send Accept-Encoding: gzip, deflate.

In any case, I don't expect a documentation browser to get thousands of requests per second.

@lnicola
Copy link

lnicola commented Feb 13, 2019

Another thing to consider is that if you're just browsing the docs on your computer, you might as send the files to the browser without compression. And if you want to host your crate's documentation somewhere, static file hosting is probably more accessible than a VPS or something that can run code.

I'm not sure what other use cases you're thinking of. Being able to serve compressed content might ultimately be a nice feature, but wouldn't really matter.

@killercup
Copy link
Owner Author

killercup commented Feb 13, 2019 via email

@lnicola
Copy link

lnicola commented Feb 13, 2019

Fair enough, there's nothing bad in wanting it to be as fast as possible.

@Nemo157
Copy link

Nemo157 commented Apr 17, 2020

It would be nice to support alternative compression formats, brotli/zstd would both be useful as they compress html better than gzip. Maybe the index could record a global or per-file format, and maybe even support multiple formats to allow the server to negotiate which to serve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants