Consider clustering `all.requests` table by `page` or `rank` #263

tunetheweb · 2024-04-09T15:32:29Z

One thing I find really handy for the all.pages table is setting rank = 1000 as a quick way to get results and save costs but still see real data (often the more interesting data too, to be honest!).

We can't do that with the all.requests table. We also can't quickly look up the data for a simple site so can't do this via the all.pages table either. It would be handy to be able to do either of these by clustering the all.requests table by page or rank .

Now there are a max of 4 clustering columns and we're already using 4 for all.requests:

client
is_root_page
is_main_document
type

These are all useful so we'd need to drop one if we wanted to add a new column.

I think is_main_document is useful, but can mostly be repeated by type='html' AND is_main_document (not entirely but 99.8% of cases and the most useful ones!) so I'd prefer to replace that with either page or rank. I'm thinking page as can use that to get rank, but open to ideas.

The text was updated successfully, but these errors were encountered:

tunetheweb · 2024-04-23T17:22:43Z

Or maybe we should have wptid in requests table top allow joins on that instead of page?

max-ostapenko · 2024-08-13T11:31:51Z

Considering there is currently only 1 column to replace, I'd better go with a more granular page. Helpful for various page category analysis and debugging.
If needed, we could reproduce rank clustering by including only particular rank URLs from a CrUX list.

I think there is a unique wptid per each page value, no? In this case page is more readable alternative.

max-ostapenko · 2024-09-03T16:43:34Z

Here are queries for two sampled tables clustered by rank and page each:

There is no optimization happening with page-clustered table.

Here I added a summary column and used a temporary table for a more obvious and fairer bytes comparison.

Same result.

I believe the cluster column query limitation doesn't allow to use page flexibly.
And without this, rank is more useful for analysis.

@tunetheweb did you have another case in mind for page clustering?

tunetheweb · 2024-09-04T10:04:25Z

Ah that's disappointing.

There are pther benefits to clustering on page for select pages (e.g. get me all the requests for https://www.example.com is currently quite expensive as requires a full table load), but given HTTP Archive is mostly about bulk analysis, and that this doesn't also allow rank anyway based on your experiments, I agree rank is more useful.

max-ostapenko mentioned this issue Sep 3, 2024

Stable all.requests HTTPArchive/dataform#5

Merged

4 tasks

max-ostapenko mentioned this issue Sep 3, 2024

Add rank field to all.requests #189

Closed

max-ostapenko closed this as completed in HTTPArchive/dataform#5 Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider clustering `all.requests` table by `page` or `rank` #263

Consider clustering `all.requests` table by `page` or `rank` #263

tunetheweb commented Apr 9, 2024 •

edited

Loading

tunetheweb commented Apr 23, 2024

max-ostapenko commented Aug 13, 2024 •

edited

Loading

max-ostapenko commented Sep 3, 2024 •

edited

Loading

tunetheweb commented Sep 4, 2024

Consider clustering all.requests table by page or rank #263

Consider clustering all.requests table by page or rank #263

Comments

tunetheweb commented Apr 9, 2024 • edited Loading

tunetheweb commented Apr 23, 2024

max-ostapenko commented Aug 13, 2024 • edited Loading

max-ostapenko commented Sep 3, 2024 • edited Loading

tunetheweb commented Sep 4, 2024

Consider clustering `all.requests` table by `page` or `rank` #263

Consider clustering `all.requests` table by `page` or `rank` #263

tunetheweb commented Apr 9, 2024 •

edited

Loading

max-ostapenko commented Aug 13, 2024 •

edited

Loading

max-ostapenko commented Sep 3, 2024 •

edited

Loading