IPWB-Compatible collection archiving task proof-of-concept #5

b5 · 2017-07-11T17:02:43Z

Connecting @machawk1 & @oduwsdl: oduwsdl/ipwb#211

We should define a task that:

Start with a user-generated collection of URLs. Allow users to fire off a "task" that will...
Generate a WARC of that collection using https://github.com/datatogether/warc
Generate an IPWB-Compatible CDXJ file using https://github.com/datatogether/cdxj
Put all of that on IPFS
Demo the WARC in IPWB.

machawk1 · 2017-07-11T17:25:23Z

When generating WARCs using user-agents other than browsers, it's possible that the capture may not be comprehensive to the extent needed for accurate replay. For example, if I wget -p -k --warc-file=myarchive uriWithLotsOfJS.com, wget may not grab the representations of resources that are conventionally surfaced via JS. This could also be applied to dynamically built URIs, URIs within resources with URIs within, etc.

For the sake of a demo, it might be useful to first examine the potential for missed URIs when dereferencing them while creating the WARCs (the lib might need to handle this).

b5 · 2017-07-11T22:06:01Z

Interesting. Would you recommend applying user-agent spoofing at all? I'm thinking of this approach.

Either way, noted! Part of me thinks we should build / seek out some sort of "archiving obstacle course" to run tests against. If this doesn't already exist, it seems like it'd be worth having around for a number of different projects

machawk1 · 2017-07-11T22:17:02Z

@b5 It's not necessarily the user-agent string but the capability of the agent. If the agent does not execute JS, some resource representations may not be surfaced and thus not archived by the tool.

Awhile back I put together the Archival Acid Test (more info in the short paper) to evaluate existing crawlers/archival tools but that was a few years ago. Since then, I know the UK Web Archive started writing some evaluation mechanism and I believe @N0taN3rd is in the process of rewriting and extending my previous tests.

N0taN3rd · 2017-07-27T03:15:18Z

@machawk1 @b5 Yes I am currently compiling a Good Luck Youll Need It list with implementation
But until that is finished you can have some fun with iframe madness and a mini replay test for 2017-03-09: A State Of Replay or Location, Location, Location

iframe madness is currently unarchivable (Internet Archive) for all non high-fidelity archives

IPWB is high-fidelity 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPWB-Compatible collection archiving task proof-of-concept #5

IPWB-Compatible collection archiving task proof-of-concept #5

b5 commented Jul 11, 2017

machawk1 commented Jul 11, 2017

b5 commented Jul 11, 2017

machawk1 commented Jul 11, 2017

N0taN3rd commented Jul 27, 2017 •

edited

Loading

IPWB-Compatible collection archiving task proof-of-concept #5

IPWB-Compatible collection archiving task proof-of-concept #5

Comments

b5 commented Jul 11, 2017

machawk1 commented Jul 11, 2017

b5 commented Jul 11, 2017

machawk1 commented Jul 11, 2017

N0taN3rd commented Jul 27, 2017 • edited Loading

N0taN3rd commented Jul 27, 2017 •

edited

Loading