Description
While investigating* a high number of origin trials attributed to the google.com origin, I found that we have several thousand pages that immediately redirect to google.com.
For HTTP Archive purposes we don't have any use for additional copies of the google.com page, so these would ideally be discarded. More generally though, if a page redirects, under which conditions should we drop it from the crawl?
There are definitely valid redirects that we should keep, eg example.com to www.example.com, or http to https, both of which are examples of same-domain, cross-origin redirects. But I can't think of many good reasons to keep cross-domain redirects, eg example.com to google.com.
Here's a query that produces some sample pages and their google.com redirect locations:
SELECT
url,
header.value AS redirect_location
FROM
`httparchive.all.requests`,
UNNEST(response_headers) AS header
WHERE
date = '2023-06-01' AND
client = 'mobile' AND
index = 1 AND
LOWER(header.name) = 'location' AND
NET.REG_DOMAIN(header.value) = 'google.com'
* For more context on the original issue I was investigating, I'm considering an origin trial token to be invalid if the origin assigned to the token (google.com) doesn't match the host page (page
). However, several thousand of these pages appear to have the Origin-Trial
headers of google.com but their original page
value, causing the tokens to be marked invalid.