Consider dropping pages with cross-domain redirects

While investigating* a high number of origin trials attributed to the google.com origin, I found that we have several thousand pages that immediately redirect to google.com.

For HTTP Archive purposes we don't have any use for additional copies of the google.com page, so these would ideally be discarded. More generally though, if a page redirects, under which conditions should we drop it from the crawl?

There are definitely valid redirects that we should keep, eg example.com to www.example.com, or http to https, both of which are examples of same-domain, cross-origin redirects. But I can't think of many good reasons to keep cross-domain redirects, eg example.com to google.com.

Here's a query that produces some sample pages and their google.com redirect locations:

```sql
SELECT
  url,
  header.value AS redirect_location
FROM
  `httparchive.all.requests`,
  UNNEST(response_headers) AS header
WHERE
  date = '2023-06-01' AND
  client = 'mobile' AND
  index = 1 AND
  LOWER(header.name) = 'location' AND
  NET.REG_DOMAIN(header.value) = 'google.com'
```

url | redirect_location
-- | --
https://crisistuesdayartillery.com/ | https://google.com
https://oautes-tg.su/ | https://google.com
https://borderoffenseantenna.com/ | https://google.com
https://adversespurt.com/ | https://google.com
https://weavelurkwiden.com/ | https://google.com
https://favoritenought.com/ | https://google.com
http://perryvolleyball.com/ | https://google.com
http://tvla.xyz/ | https://www.google.com/
http://depositnostrilverge.com/ | https://google.com
http://bypasseaseboot.com/ | https://google.com
https://mapsplatform.withgoogle.com/ | https://www.google.com/maps/about/#!/
https://ww.resgatabonus.club/ | https://www.google.com/search?q=noticia
https://dtrack.link/ | https://google.com
https://coherencedefinitionupstanding.com/ | https://google.com

_* For more context on the original issue I was investigating, I'm considering an origin trial token to be invalid if the origin assigned to the token (google.com) doesn't match the host page (`page`). However, several thousand of these pages appear to have the `Origin-Trial` headers of google.com but their original `page` value, causing the tokens to be marked invalid._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider dropping pages with cross-domain redirects #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

url	redirect_location
https://crisistuesdayartillery.com/	https://google.com
https://oautes-tg.su/	https://google.com
https://borderoffenseantenna.com/	https://google.com
https://adversespurt.com/	https://google.com
https://weavelurkwiden.com/	https://google.com
https://favoritenought.com/	https://google.com
http://perryvolleyball.com/	https://google.com
http://tvla.xyz/	https://www.google.com/
http://depositnostrilverge.com/	https://google.com
http://bypasseaseboot.com/	https://google.com
https://mapsplatform.withgoogle.com/	https://www.google.com/maps/about/#!/
https://ww.resgatabonus.club/	https://www.google.com/search?q=noticia
https://dtrack.link/	https://google.com
https://coherencedefinitionupstanding.com/	https://google.com

Consider dropping pages with cross-domain redirects #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions