Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories #13

b5 · 2017-08-16T18:33:04Z

I'm seeking feedback on a decision regarding setting a sensible default for writing WARC records to the distributed web. It has implications for de-duplication between archives, and might also have technical implications for IPFS itself, given the number of files in question. It boils down to weather we write lots & lots of small files to top level hashes without any directories (using an index to coordinate them), or organize those same files into directories & publish the directory.

lots-of-files form:

ipfs
    ├── QmPUppm3rJ7szjH8cvafitWVkFwPhSK85w8ZSgSLy37Fzy # index.cdxj
    ├── QmPyMc9SFDUSvNhd9HS9fDGKumjmuNdfQ7NTkD1k8p1MHR # archive_header_1.warc
    ├── QmQ1oV7NenYqq4BkMd7dNm8CVT6WAVqvJqupAVGCo3798x # archive_record_1.warc
    ├── QmQN1RjGraspPt84Bs9bvEvRsksJXBdAjiN8mPuzxJ6RmL # archive_header_2.warc
    ├── QmR51w5YmM18pm4rAuuAtpnrXakv5X4MMwUnWPuA8VZsSZ # archive_record_2.warc
    └──...

directory form:

ipfs
    └── QmP6XohRBYc8y52dpyJsANYq1SZBzy618t388Wk2SPKNG9
        ├── index.cdxj
        ├── archive_header_1.warc
        ├── archive_record_1.warc
        ├── archive_header_2.warc
        ├── archive_record_2.warc
        └──...

If I understand IPFS correctly, the lots-of-files form leads to higher chances that two archives will naturally deduplicate by having records of the same url collide on the same hash. The possible downside is, (again, if I understand this correctly), this strategy places a lot of additional pressure on the distributed hash tables that IPFS needs to maintain to resolve who has what. I'd love to hear from anyone at protocol on weather this is true or not. Also, does a file nested in a directory structure have a hash of it's own, and can the network resolve that hash even though it's embedded within another? I have a hunch on these questions, but I'd like to confirm.

It's also worth mentioning that either way, we will end up building support for both, the question is what should be the default?

WARC files are a collection WARC of records. As I've heard them described thus far, a collection of WARC files are often used to represent a crawl. To me this makes a lot of sense to say "this set of files encapsulates a discrete crawl that we performed using blah crawler with blah settings". As far as I can tell this boundary setting is for two reasons:

Our puny brains to get our heads around what is this an archive of as a discreet entity (encapsulated, discrete crawls).
Resource constraints.

From the WARC 1.1 spec:

Per section 2.2 of the GZIP specification, a valid GZIP file consists of
any number of gzip "members", each independently compressed.
Where possible, this property should be exploited to compress each
record of a WARC file independently. This results in a valid GZIP file
whose per-record subranges also stand alone as valid GZIP files.
External indexes of WARC file content may then be used to record each
record's starting position in the GZIP file, allowing for random access
of individual records without requiring decompression of all preceding
records.

I'd like to imagine crawling continuously, directly onto the distributed web. When moving from one internet to another internet, encapsulation doesn't really apply. By writing individual records themselves directly onto a content addressed network, and after a big conversation to be had about compression, we arrive at a world where crawlers naturally de-duplicate themselves. So, barring this breaking the distributed web, I'd love to be writing using the lots-of-files option, and leave the directory-based option for "special projects" where a user would want to encapsulate a set of archives because they only make sense together.

b5 · 2017-08-16T18:35:15Z

cc @machawk1, @flyingzumwalt

machawk1 · 2017-08-16T18:53:02Z

does a file nested in a directory structure have a hash of it's own

Yes, note the delta in the "b5" directory below when an additional file is added.

$ mkdir b5
$ echo 'Test A' > b5/testA.txt
$ echo 'Test B' > b5/testB.txt
$ ipfs add b5/testA.txt 
added QmXyTiR9CWMBEZH8yHuK9Me61nZSX6WGYG9JRY8nYaNw5U testA.txt
$ ipfs add -r b5
added QmXyTiR9CWMBEZH8yHuK9Me61nZSX6WGYG9JRY8nYaNw5U b5/testA.txt
added QmcbamY3sADFPsutCopyiobmqSME4KBWhihQycQ8Bg52wE b5/testB.txt
added QmRR1cC8gGdZ5ZyHGDSsvzBfeFZWuNyF8how96WZ1mP147 b5
$ echo 'Test C' > b5/testC.txt
$ ipfs add -r b5
added QmXyTiR9CWMBEZH8yHuK9Me61nZSX6WGYG9JRY8nYaNw5U b5/testA.txt
added QmcbamY3sADFPsutCopyiobmqSME4KBWhihQycQ8Bg52wE b5/testB.txt
added QmNrN6pMPiavxn57YVeXS3mxuXd7YY3LoZXMneNM3ELW4J b5/testC.txt
added QmbEwzNK42thvhghHEwv8c6XZsUDwp9enJN5pGs9W211rU b5

machawk1 · 2017-08-16T19:03:37Z

As another caveat that has ramifications for de-duplication, the entity-body in the HTTP response may vary over time, i.e., ipfs add of differently chunked payloads will result in different hashes. In ipwb (See #125), we de-(un?)chunk the payload prior to pushing into IPFS then don't need to worry about chunked responses when pulling back from IPFS for replay. Manipulating the content is usually a preservation no-no.

KrzysztofMadejski · 2017-08-21T11:34:26Z

Just to raise a potential side issue regarding filesystem operations: having millions files/directories in one directory quite often slows listings a lot. If listing is a common operation it may be good to break hashes into one or two levels subdirs, such as:

ipfs
    ├── QmP
            ├── Uppm3rJ7szjH8cvafitWVkFwPhSK85w8ZSgSLy37Fzy # index.cdxj
            ├── yMc9SFDUSvNhd9HS9fDGKumjmuNdfQ7NTkD1k8p1MHR # archive_header_1.warc
    ├── QmQ
            ├── 1oV7NenYqq4BkMd7dNm8CVT6WAVqvJqupAVGCo3798x # archive_record_1.warc
            ├── N1RjGraspPt84Bs9bvEvRsksJXBdAjiN8mPuzxJ6RmL # archive_header_2.warc
    ├── QmR
            ├── 51w5YmM18pm4rAuuAtpnrXakv5X4MMwUnWPuA8VZsSZ # archive_record_2.warc
    └──...

b5 added enhancement question and removed enhancement labels Aug 16, 2017

b5 added this to the WARC Spec Parity milestone Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories #13

Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories #13

b5 commented Aug 16, 2017

b5 commented Aug 16, 2017

machawk1 commented Aug 16, 2017 •

edited

Loading

machawk1 commented Aug 16, 2017

KrzysztofMadejski commented Aug 21, 2017

Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories #13

Should WARC records on the distributed web default to a flat list of hashes, or should we crawl to directories #13

Comments

b5 commented Aug 16, 2017

lots-of-files form:

directory form:

b5 commented Aug 16, 2017

machawk1 commented Aug 16, 2017 • edited Loading

machawk1 commented Aug 16, 2017

KrzysztofMadejski commented Aug 21, 2017

machawk1 commented Aug 16, 2017 •

edited

Loading