-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider dropping pages with cross-domain redirects #23
Comments
I was under the impression that the CrUX list is build off of final page
URL.
If that's the case then any redirects should hopefully be transitional and
be flushed out in the next run. That also means that the origin of the test
url and the origin of the final url should match and if not then it
probably shouldn't be included.
…On Tue, Jul 4, 2023 at 7:31 PM Rick Viscomi ***@***.***> wrote:
Assigned HTTPArchive/wptagent#23 <#23>
to @pmeenan <https://github.com/pmeenan>.
—
Reply to this email directly, view it on GitHub
<#23>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADMOBKDG5FH6LVAHMMEEYTXOSRTLANCNFSM6AAAAAAZ6HBFC4>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
The origins in CrUX only assure us that a sufficient number of real users visited pages under those origins, not necessarily their home pages. Some might not have home pages at all, or they might do some weird geo/auth gating.
|
Looks like this would affect about 250k pages SELECT
(NET.REG_DOMAIN(page) = NET.REG_DOMAIN(url)) IS TRUE AS same_domain,
COUNT(0) AS pages
FROM
`httparchive.all.requests`
WHERE
date = '2023-06-01' AND
client = 'mobile' AND
is_main_document
GROUP BY
same_domain
|
This should already be fixed. If the final document origin doesn't match the navigated origin it will fail with a |
While investigating* a high number of origin trials attributed to the google.com origin, I found that we have several thousand pages that immediately redirect to google.com.
For HTTP Archive purposes we don't have any use for additional copies of the google.com page, so these would ideally be discarded. More generally though, if a page redirects, under which conditions should we drop it from the crawl?
There are definitely valid redirects that we should keep, eg example.com to www.example.com, or http to https, both of which are examples of same-domain, cross-origin redirects. But I can't think of many good reasons to keep cross-domain redirects, eg example.com to google.com.
Here's a query that produces some sample pages and their google.com redirect locations:
* For more context on the original issue I was investigating, I'm considering an origin trial token to be invalid if the origin assigned to the token (google.com) doesn't match the host page (
page
). However, several thousand of these pages appear to have theOrigin-Trial
headers of google.com but their originalpage
value, causing the tokens to be marked invalid.The text was updated successfully, but these errors were encountered: