SUCHO Scripts

Repository for scripts and tools for SUCHO.

The yaml_path should point to where your yaml file is located and should look something like ./crawls/collections/NAME_OF_CRAWL/crawls/FILENAME.yaml The output_path should point to where you want the output file to be saved. If left empty it will save to the same directory you are running this script. Finally get_csv and send_wayback are optional flags. get_csv will generate a csv file for the wayback gsheets service. send_wayback will send the links directly to the wayback service.

If you decide to use the send_wayback flag, you'll need to have an internet archive account and specifically their S3 configuration. Instructions for accessing your keys are available here https://archive.org/services/docs/api/internetarchive/api.html#ia-s3-configuration and I'm assuming that you are storing them as environment variables INTERNET_ARCHIVE_ACCESS_KEY and INTERNET_ARCHIVE_SECRET_KEY (feel free to edit the script to work with your setup though).

Thanks to Eric Kansa for sharing his scraping code https://github.com/opencontext/sucho-data-rescue-scrape, which I reused parts of for the calls to wayback machine.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
failed_browsertrix_to_wayback.py		failed_browsertrix_to_wayback.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUCHO Scripts

Contents

About

Releases

Packages

Languages

starchy/sucho_scripts

Folders and files

Latest commit

History

Repository files navigation

SUCHO Scripts

Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages