You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeking feedback on a decision regarding setting a sensible default for writing WARC records to the distributed web. It has implications for de-duplication between archives, and might also have technical implications for IPFS itself, given the number of files in question. It boils down to weather we write lots & lots of small files to top level hashes without any directories (using an index to coordinate them), or organize those same files into directories & publish the directory.
If I understand IPFS correctly, the lots-of-files form leads to higher chances that two archives will naturally deduplicate by having records of the same url collide on the same hash. The possible downside is, (again, if I understand this correctly), this strategy places a lot of additional pressure on the distributed hash tables that IPFS needs to maintain to resolve who has what. I'd love to hear from anyone at protocol on weather this is true or not. Also, does a file nested in a directory structure have a hash of it's own, and can the network resolve that hash even though it's embedded within another? I have a hunch on these questions, but I'd like to confirm.
It's also worth mentioning that either way, we will end up building support for both, the question is what should be the default?
WARC files are a collection WARC of records. As I've heard them described thus far, a collection of WARC files are often used to represent a crawl. To me this makes a lot of sense to say "this set of files encapsulates a discrete crawl that we performed using blah crawler with blah settings". As far as I can tell this boundary setting is for two reasons:
Our puny brains to get our heads around what is this an archive of as a discreet entity (encapsulated, discrete crawls).
Resource constraints.
From the WARC 1.1 spec:
Per section 2.2 of the GZIP specification, a valid GZIP file consists of
any number of gzip "members", each independently compressed.
Where possible, this property should be exploited to compress each
record of a WARC file independently. This results in a valid GZIP file
whose per-record subranges also stand alone as valid GZIP files.
External indexes of WARC file content may then be used to record each
record's starting position in the GZIP file, allowing for random access
of individual records without requiring decompression of all preceding
records.
I'd like to imagine crawling continuously, directly onto the distributed web. When moving from one internet to another internet, encapsulation doesn't really apply. By writing individual records themselves directly onto a content addressed network, and after a big conversation to be had about compression, we arrive at a world where crawlers naturally de-duplicate themselves. So, barring this breaking the distributed web, I'd love to be writing using the lots-of-files option, and leave the directory-based option for "special projects" where a user would want to encapsulate a set of archives because they only make sense together.
The text was updated successfully, but these errors were encountered:
As another caveat that has ramifications for de-duplication, the entity-body in the HTTP response may vary over time, i.e., ipfs add of differently chunked payloads will result in different hashes. In ipwb (See #125), we de-(un?)chunk the payload prior to pushing into IPFS then don't need to worry about chunked responses when pulling back from IPFS for replay. Manipulating the content is usually a preservation no-no.
Just to raise a potential side issue regarding filesystem operations: having millions files/directories in one directory quite often slows listings a lot. If listing is a common operation it may be good to break hashes into one or two levels subdirs, such as:
I'm seeking feedback on a decision regarding setting a sensible default for writing WARC records to the distributed web. It has implications for de-duplication between archives, and might also have technical implications for IPFS itself, given the number of files in question. It boils down to weather we write lots & lots of small files to top level hashes without any directories (using an index to coordinate them), or organize those same files into directories & publish the directory.
lots-of-files form:
directory form:
If I understand IPFS correctly, the lots-of-files form leads to higher chances that two archives will naturally deduplicate by having records of the same url collide on the same hash. The possible downside is, (again, if I understand this correctly), this strategy places a lot of additional pressure on the distributed hash tables that IPFS needs to maintain to resolve who has what. I'd love to hear from anyone at protocol on weather this is true or not. Also, does a file nested in a directory structure have a hash of it's own, and can the network resolve that hash even though it's embedded within another? I have a hunch on these questions, but I'd like to confirm.
It's also worth mentioning that either way, we will end up building support for both, the question is what should be the default?
WARC files are a collection WARC of records. As I've heard them described thus far, a collection of WARC files are often used to represent a crawl. To me this makes a lot of sense to say "this set of files encapsulates a discrete crawl that we performed using blah crawler with blah settings". As far as I can tell this boundary setting is for two reasons:
From the WARC 1.1 spec:
I'd like to imagine crawling continuously, directly onto the distributed web. When moving from one internet to another internet, encapsulation doesn't really apply. By writing individual records themselves directly onto a content addressed network, and after a big conversation to be had about compression, we arrive at a world where crawlers naturally de-duplicate themselves. So, barring this breaking the distributed web, I'd love to be writing using the lots-of-files option, and leave the directory-based option for "special projects" where a user would want to encapsulate a set of archives because they only make sense together.
The text was updated successfully, but these errors were encountered: