Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClosedConnectionError & rate limiting #82

Open
jordannickerson opened this issue Aug 26, 2021 · 2 comments · May be fixed by #152
Open

ClosedConnectionError & rate limiting #82

jordannickerson opened this issue Aug 26, 2021 · 2 comments · May be fixed by #152
Milestone

Comments

@jordannickerson
Copy link

I apologize for the slight abuse of the term "Issues", as I don't think the problem I'm encountering is a true issue of your project.

While using wayback, I've run into issues with the connection being closed by the remote host. I've been performing a lot of search requests/pulling mementos, and suspect I'm hitting a rate limit. However, I have put a large delay between queries (5ish seconds).

Is there a best practice on how much we should throttle usage, and are there other things that we should do beyond just looping over all our searches with a time.sleep call to avoid slamming the server?

@Mr0grog
Copy link
Member

Mr0grog commented Aug 31, 2021

No worries! TBH, I’ve lost track of the current rate limits the Wayback Machine imposes, but I think earlier this year it was at 10 requests/second for both CDX search (i.e. WaybackClient.search()) and mementos (WaybackClient.get_memento()).

If you are using multiple threads, you can do some messy stuff to share connections across threads, which has helped us reduce connection errors with Wayback in these code samples:

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L211-L252

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L473-L485

https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/12e3c15177807a118e9bb344ed1daedb47a14a30/web_monitoring/cli/cli.py#L320-L323

That’s way over-complicated and I hope to get that functionality built-in to this package as part of #58.

You also might find some useful inspiration from other parts of the above script, which we use to pull in ~20 GB of data every night from Wayback. It’s really messy and a bit hard to follow, though. (It’s had a lot of iterations but limited time to really clean it up over the last few years, and is what this package was originally extracted from.)

(Sorry about the slow feedback here, @jordannickerson. I’ve been semi-offline for the last couple weeks.)

@Mr0grog Mr0grog added this to the v0.5.0 milestone Nov 10, 2022
@Mr0grog
Copy link
Member

Mr0grog commented Dec 13, 2023

Quick update: I’m considering this a duplicate of #58, which I am pretty committed to actually solving this month.

@Mr0grog Mr0grog moved this to Prioritized in Wayback Roadmap Dec 13, 2023
@Mr0grog Mr0grog moved this from Prioritized to In Progress in Wayback Roadmap Dec 13, 2023
@Mr0grog Mr0grog linked a pull request Dec 14, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants