Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider dropping pages with cross-domain redirects #23

Closed
rviscomi opened this issue Jul 4, 2023 · 4 comments
Closed

Consider dropping pages with cross-domain redirects #23

rviscomi opened this issue Jul 4, 2023 · 4 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented Jul 4, 2023

While investigating* a high number of origin trials attributed to the google.com origin, I found that we have several thousand pages that immediately redirect to google.com.

For HTTP Archive purposes we don't have any use for additional copies of the google.com page, so these would ideally be discarded. More generally though, if a page redirects, under which conditions should we drop it from the crawl?

There are definitely valid redirects that we should keep, eg example.com to www.example.com, or http to https, both of which are examples of same-domain, cross-origin redirects. But I can't think of many good reasons to keep cross-domain redirects, eg example.com to google.com.

Here's a query that produces some sample pages and their google.com redirect locations:

SELECT
  url,
  header.value AS redirect_location
FROM
  `httparchive.all.requests`,
  UNNEST(response_headers) AS header
WHERE
  date = '2023-06-01' AND
  client = 'mobile' AND
  index = 1 AND
  LOWER(header.name) = 'location' AND
  NET.REG_DOMAIN(header.value) = 'google.com'
url redirect_location
https://crisistuesdayartillery.com/ https://google.com
https://oautes-tg.su/ https://google.com
https://borderoffenseantenna.com/ https://google.com
https://adversespurt.com/ https://google.com
https://weavelurkwiden.com/ https://google.com
https://favoritenought.com/ https://google.com
http://perryvolleyball.com/ https://google.com
http://tvla.xyz/ https://www.google.com/
http://depositnostrilverge.com/ https://google.com
http://bypasseaseboot.com/ https://google.com
https://mapsplatform.withgoogle.com/ https://www.google.com/maps/about/#!/
https://ww.resgatabonus.club/ https://www.google.com/search?q=noticia
https://dtrack.link/ https://google.com
https://coherencedefinitionupstanding.com/ https://google.com

* For more context on the original issue I was investigating, I'm considering an origin trial token to be invalid if the origin assigned to the token (google.com) doesn't match the host page (page). However, several thousand of these pages appear to have the Origin-Trial headers of google.com but their original page value, causing the tokens to be marked invalid.

@pmeenan
Copy link
Member

pmeenan commented Jul 4, 2023 via email

@rviscomi
Copy link
Member Author

rviscomi commented Jul 4, 2023

The origins in CrUX only assure us that a sufficient number of real users visited pages under those origins, not necessarily their home pages. Some might not have home pages at all, or they might do some weird geo/auth gating.

/ --> /index.html is ok (same-origin)
foo.com --> bar.com is not ok (cross-domain) and hopefully bar.com is in CrUX so we test it anyway
Not so sure about things like example.com --> www.example.com or http to https.

@rviscomi
Copy link
Member Author

rviscomi commented Jul 5, 2023

Looks like this would affect about 250k pages

SELECT
  (NET.REG_DOMAIN(page) = NET.REG_DOMAIN(url)) IS TRUE AS same_domain,
  COUNT(0) AS pages
FROM
  `httparchive.all.requests`
WHERE
  date = '2023-06-01' AND
  client = 'mobile' AND
  is_main_document
GROUP BY
  same_domain
same_domain pages
TRUE 30537873
FALSE 242085

@rviscomi rviscomi assigned rviscomi and unassigned pmeenan Aug 15, 2023
@max-ostapenko max-ostapenko transferred this issue from HTTPArchive/data-pipeline Oct 18, 2024
@pmeenan
Copy link
Member

pmeenan commented Oct 18, 2024

This should already be fixed. If the final document origin doesn't match the navigated origin it will fail with a 888 error code.

@pmeenan pmeenan closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants