-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter to filter out redirects from one result methods #173
Comments
I think that's already a feature with the 'filter' parameter for the WaybackMachineCDXServerAPI class. One example given in the tests file is: cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
) |
Yes, but that's using Python and performing multiple steps: preparing, getting the results, looping through them. The one line "near, oldest, newest" methods are handy to avoid all that, especially through CLI. The filter CLI argument is ignored when using one of those methods if I'm not mistaken. |
For URLs with 200 only status code:
Non-200 status code:
|
And how do we detect the redirects(status 302 Found) and statuses? By visiting the archive to actually check or by just reading the CDX data for the archive? |
By relying on the CDX status code yes, 200 or not 200. But yes, you are right, your given example could do the trick as well. I'm happy using that if limit to one would work, but it looks like the limit parameter is ignored completely. |
The limit is not ignored but it is actually a CDX API param to limit number of archive data returned per API call when using the non-paginated CDX API. see https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#query-result-limits. |
Maybe I'm misunderstanding its purpose, but unlike with your example where I get a lot more than the set limit of 10 results, when I call the API directly as in the API docs I only get what I set the limit to: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=2 |
Is your feature request related to a problem? Please describe.
Methods near, oldest, newest return no matter what type of snapshot is available. This also includes redirects, which aren't that useful in a lot of cases.
Describe the solution you'd like
Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter.
The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.
The text was updated successfully, but these errors were encountered: