Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

IPWB-Compatible collection archiving task proof-of-concept #5

Open
b5 opened this issue Jul 11, 2017 · 4 comments
Open

IPWB-Compatible collection archiving task proof-of-concept #5

b5 opened this issue Jul 11, 2017 · 4 comments

Comments

@b5
Copy link
Member

b5 commented Jul 11, 2017

Connecting @machawk1 & @oduwsdl: oduwsdl/ipwb#211

We should define a task that:

  1. Start with a user-generated collection of URLs. Allow users to fire off a "task" that will...
  2. Generate a WARC of that collection using https://github.com/datatogether/warc
  3. Generate an IPWB-Compatible CDXJ file using https://github.com/datatogether/cdxj
  4. Put all of that on IPFS
  5. Demo the WARC in IPWB.
@machawk1
Copy link
Member

When generating WARCs using user-agents other than browsers, it's possible that the capture may not be comprehensive to the extent needed for accurate replay. For example, if I wget -p -k --warc-file=myarchive uriWithLotsOfJS.com, wget may not grab the representations of resources that are conventionally surfaced via JS. This could also be applied to dynamically built URIs, URIs within resources with URIs within, etc.

For the sake of a demo, it might be useful to first examine the potential for missed URIs when dereferencing them while creating the WARCs (the lib might need to handle this).

@b5
Copy link
Member Author

b5 commented Jul 11, 2017

Interesting. Would you recommend applying user-agent spoofing at all? I'm thinking of this approach.

Either way, noted! Part of me thinks we should build / seek out some sort of "archiving obstacle course" to run tests against. If this doesn't already exist, it seems like it'd be worth having around for a number of different projects

@machawk1
Copy link
Member

@b5 It's not necessarily the user-agent string but the capability of the agent. If the agent does not execute JS, some resource representations may not be surfaced and thus not archived by the tool.

Awhile back I put together the Archival Acid Test (more info in the short paper) to evaluate existing crawlers/archival tools but that was a few years ago. Since then, I know the UK Web Archive started writing some evaluation mechanism and I believe @N0taN3rd is in the process of rewriting and extending my previous tests.

@N0taN3rd
Copy link

N0taN3rd commented Jul 27, 2017

@machawk1 @b5 Yes I am currently compiling a Good Luck Youll Need It list with implementation
But until that is finished you can have some fun with iframe madness and a mini replay test for 2017-03-09: A State Of Replay or Location, Location, Location

iframe madness is currently unarchivable (Internet Archive) for all non high-fidelity archives

IPWB is high-fidelity 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

3 participants