Skip to content

Consider dropping pages with cross-domain redirects #23

Closed
@rviscomi

Description

@rviscomi

While investigating* a high number of origin trials attributed to the google.com origin, I found that we have several thousand pages that immediately redirect to google.com.

For HTTP Archive purposes we don't have any use for additional copies of the google.com page, so these would ideally be discarded. More generally though, if a page redirects, under which conditions should we drop it from the crawl?

There are definitely valid redirects that we should keep, eg example.com to www.example.com, or http to https, both of which are examples of same-domain, cross-origin redirects. But I can't think of many good reasons to keep cross-domain redirects, eg example.com to google.com.

Here's a query that produces some sample pages and their google.com redirect locations:

SELECT
  url,
  header.value AS redirect_location
FROM
  `httparchive.all.requests`,
  UNNEST(response_headers) AS header
WHERE
  date = '2023-06-01' AND
  client = 'mobile' AND
  index = 1 AND
  LOWER(header.name) = 'location' AND
  NET.REG_DOMAIN(header.value) = 'google.com'
url redirect_location
https://crisistuesdayartillery.com/ https://google.com
https://oautes-tg.su/ https://google.com
https://borderoffenseantenna.com/ https://google.com
https://adversespurt.com/ https://google.com
https://weavelurkwiden.com/ https://google.com
https://favoritenought.com/ https://google.com
http://perryvolleyball.com/ https://google.com
http://tvla.xyz/ https://www.google.com/
http://depositnostrilverge.com/ https://google.com
http://bypasseaseboot.com/ https://google.com
https://mapsplatform.withgoogle.com/ https://www.google.com/maps/about/#!/
https://ww.resgatabonus.club/ https://www.google.com/search?q=noticia
https://dtrack.link/ https://google.com
https://coherencedefinitionupstanding.com/ https://google.com

* For more context on the original issue I was investigating, I'm considering an origin trial token to be invalid if the origin assigned to the token (google.com) doesn't match the host page (page). However, several thousand of these pages appear to have the Origin-Trial headers of google.com but their original page value, causing the tokens to be marked invalid.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions