Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider clustering all.requests table by page or rank #263

Closed
tunetheweb opened this issue Apr 9, 2024 · 4 comments · Fixed by HTTPArchive/dataform#5
Closed

Consider clustering all.requests table by page or rank #263

tunetheweb opened this issue Apr 9, 2024 · 4 comments · Fixed by HTTPArchive/dataform#5

Comments

@tunetheweb
Copy link
Member

tunetheweb commented Apr 9, 2024

One thing I find really handy for the all.pages table is setting rank = 1000 as a quick way to get results and save costs but still see real data (often the more interesting data too, to be honest!).

We can't do that with the all.requests table. We also can't quickly look up the data for a simple site so can't do this via the all.pages table either. It would be handy to be able to do either of these by clustering the all.requests table by page or rank .

Now there are a max of 4 clustering columns and we're already using 4 for all.requests:

  • client
  • is_root_page
  • is_main_document
  • type

These are all useful so we'd need to drop one if we wanted to add a new column.

I think is_main_document is useful, but can mostly be repeated by type='html' AND is_main_document (not entirely but 99.8% of cases and the most useful ones!) so I'd prefer to replace that with either page or rank. I'm thinking page as can use that to get rank, but open to ideas.

@tunetheweb
Copy link
Member Author

Or maybe we should have wptid in requests table top allow joins on that instead of page?

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Aug 13, 2024

Considering there is currently only 1 column to replace, I'd better go with a more granular page. Helpful for various page category analysis and debugging.
If needed, we could reproduce rank clustering by including only particular rank URLs from a CrUX list.

I think there is a unique wptid per each page value, no? In this case page is more readable alternative.

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Sep 3, 2024

Here are queries for two sampled tables clustered by rank and page each:
Screenshot 2024-09-03 at 18 24 58
There is no optimization happening with page-clustered table.

Here I added a summary column and used a temporary table for a more obvious and fairer bytes comparison.
Screenshot 2024-09-03 at 18 30 57
Same result.

I believe the cluster column query limitation doesn't allow to use page flexibly.
And without this, rank is more useful for analysis.

@tunetheweb did you have another case in mind for page clustering?

@tunetheweb
Copy link
Member Author

Ah that's disappointing.

There are pther benefits to clustering on page for select pages (e.g. get me all the requests for https://www.example.com is currently quite expensive as requires a full table load), but given HTTP Archive is mostly about bulk analysis, and that this doesn't also allow rank anyway based on your experiments, I agree rank is more useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants