Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter to filter out redirects from one result methods #173

Open
Forage opened this issue Jul 12, 2022 · 7 comments
Open

Parameter to filter out redirects from one result methods #173

Forage opened this issue Jul 12, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@Forage
Copy link

Forage commented Jul 12, 2022

Is your feature request related to a problem? Please describe.
Methods near, oldest, newest return no matter what type of snapshot is available. This also includes redirects, which aren't that useful in a lot of cases.

Describe the solution you'd like
Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter.
The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

@Forage Forage added the enhancement New feature or request label Jul 12, 2022
@ArztKlein
Copy link
Contributor

I think that's already a feature with the 'filter' parameter for the WaybackMachineCDXServerAPI class.

One example given in the tests file is:

cdx = WaybackMachineCDXServerAPI(
        url="google.com",
        user_agent=user_agent,
        filters=["statuscode:200"],
    )

@Forage
Copy link
Author

Forage commented Jul 14, 2022

Yes, but that's using Python and performing multiple steps: preparing, getting the results, looping through them. The one line "near, oldest, newest" methods are handy to avoid all that, especially through CLI.

The filter CLI argument is ignored when using one of those methods if I'm not mistaken.
So either the filter could be made to be taken into account for most flexibility or an additional "okonly" quick argument could be introduced.

@akamhy
Copy link
Owner

akamhy commented Jul 15, 2022

For URLs with 200 only status code:

akamhy@device:~$  waybackpy  --url google.com --user-agent "foobar" --cdx --cdx-filter "statuscode:200" --limit "10" --start-timestamp "20101010" --cdx-print "archiveurl" --cdx-print "statuscode"
200 https://web.archive.org/web/20101010000314/http://www.google.com/
200 https://web.archive.org/web/20101010011249/http://www.google.com/
200 https://web.archive.org/web/20101010042108/http://www.google.com/
200 https://web.archive.org/web/20101010043106/http://www.google.com/
200 https://web.archive.org/web/20101010044436/http://www.google.com/
200 https://web.archive.org/web/20101010053035/http://www.google.com/
200 https://web.archive.org/web/20101010054150/http://www.google.com/
200 https://web.archive.org/web/20101010061344/http://www.google.com/
200 https://web.archive.org/web/20101010063445/http://www.google.com/
200 https://web.archive.org/web/20101010082449/http://www.google.com/
200 https://web.archive.org/web/20101010091719/http://www.google.com/
200 https://web.archive.org/web/20101010091734/http://www.google.com/
200 https://web.archive.org/web/20101010091920/http://www.google.com/
200 https://web.archive.org/web/20101010092939/http://www.google.com/

Non-200 status code:

akamhy@device:~$  waybackpy  --url google.com --user-agent "foobar" --cdx --cdx-filter \!statuscode:200 --limit "10" --start-timestamp "20101010" --cdx-print "archiveurl" --cdx-print "statuscode"  
301 https://web.archive.org/web/20101010003320/http://google.com/
301 https://web.archive.org/web/20101010042732/http://google.com/
301 https://web.archive.org/web/20101010101435/http://google.com/
301 https://web.archive.org/web/20101010110520/http://google.com/
301 https://web.archive.org/web/20101010111101/http://google.com/
301 https://web.archive.org/web/20101010162008/http://google.com/
- https://web.archive.org/web/20101011010719/http://google.com/
302 https://web.archive.org/web/20101011031541/http://www.google.com/
301 https://web.archive.org/web/20101011094854/http://google.com/
302 https://web.archive.org/web/20101011103045/http://www.google.com/
302 https://web.archive.org/web/20101011103404/http://www.google.com/
302 https://web.archive.org/web/20101011125706/http://www.google.com/
302 https://web.archive.org/web/20101011130420/http://www.google.com/
302 https://web.archive.org/web/20101011130758/http://www.google.com/
302 https://web.archive.org/web/20101011145009/http://www.google.com/
302 https://web.archive.org/web/20101011150448/http://www.google.com/
301 https://web.archive.org/web/20101012023319/http://google.com/
301 https://web.archive.org/web/20101012043932/http://google.com/
301 https://web.archive.org/web/20101012045200/http://google.com/
301 https://web.archive.org/web/20101012072233/http://google.com/
302 https://web.archive.org/web/20101012080016/http://www.google.com/
302 https://web.archive.org/web/20101012082545/http://www.google.com/
302 https://web.archive.org/web/20101012113351/http://www.google.com/
302 https://web.archive.org/web/20101012114314/http://www.google.com/
302 https://web.archive.org/web/20101012114658/http://www.google.com/
302 https://web.archive.org/web/20101012114803/http://www.google.com/
302 https://web.archive.org/web/20101012115016/http://www.google.com/
302 https://web.archive.org/web/20101012115409/http://www.google.com/
301 https://web.archive.org/web/20101012142403/http://google.com/
302 https://web.archive.org/web/20101012153200/http://www.google.com/

@akamhy
Copy link
Owner

akamhy commented Jul 15, 2022

Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter. The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

And how do we detect the redirects(status 302 Found) and statuses? By visiting the archive to actually check or by just reading the CDX data for the archive?

@Forage
Copy link
Author

Forage commented Jul 15, 2022

Methods near, oldest, newest would be a lot more efficient to use if it allows you to filter out redirects directly with an additional parameter. The same goes for errors for that matter. One option to exclude all 40x, 50x and 30x response code links, only leaving 200 codes.

And how do we detect the redirects(status 302 Found) and statuses? By visiting the archive to actually check or by just reading the CDX data for the archive?

By relying on the CDX status code yes, 200 or not 200.

But yes, you are right, your given example could do the trick as well. I'm happy using that if limit to one would work, but it looks like the limit parameter is ignored completely.

@akamhy
Copy link
Owner

akamhy commented Jul 15, 2022

The limit is not ignored but it is actually a CDX API param to limit number of archive data returned per API call when using the non-paginated CDX API. see https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#query-result-limits.

@Forage
Copy link
Author

Forage commented Jul 15, 2022

Maybe I'm misunderstanding its purpose, but unlike with your example where I get a lot more than the set limit of 10 results, when I call the API directly as in the API docs I only get what I set the limit to: http://web.archive.org/cdx/search/cdx?url=archive.org&limit=2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants