Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rank field to all.requests #189

Closed
rviscomi opened this issue Jun 16, 2023 · 13 comments · Fixed by HTTPArchive/dataform#5
Closed

Add rank field to all.requests #189

rviscomi opened this issue Jun 16, 2023 · 13 comments · Fixed by HTTPArchive/dataform#5
Assignees
Labels
enhancement New feature or request

Comments

@rviscomi
Copy link
Member

I often find myself joining pages and requests tables just to sort by rank. This field is page-specific but it'd be helpful to have this directly available in the requests table.

@rviscomi rviscomi added the enhancement New feature or request label Jun 16, 2023
@kevinfarrugia
Copy link

It would also be helpful to run queries on a subset of all requests (to reduce costs) and filter on rank = 1000. Currently I am joining with httparchive.all.pages and filtering on rank from that table, but costs are much higher.

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Sep 2, 2024

@pmeenan I was about to JOIN rank field in SQL when copying from crawl_staging.requests, but maybe you have an idea why we should do it in wptagent instead?

@max-ostapenko max-ostapenko self-assigned this Sep 2, 2024
@pmeenan
Copy link
Member

pmeenan commented Sep 2, 2024

@max-ostapenko once the field exists in the table, the client-side schemas are here and then the value itself would get added here

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Sep 2, 2024

Seems straightforward there, and probably better to have it throughout all pipeline.
I'll add it to wptagent then.

@pmeenan
Copy link
Member

pmeenan commented Sep 2, 2024

It doesn't look like the table itself has the column yet. I don't think it will be added automatically and will probably fail if the column isn't added before the next crawl (and I'm not sure if the schema is sensitive to the field ordering)

@max-ostapenko
Copy link
Contributor

Do we need an existing data in the "crawl_staging" dataset?
If not, I'll create a blank requests table with the correct schema.

@pmeenan
Copy link
Member

pmeenan commented Sep 2, 2024

No, assuming the data has already been processed and is in the all dataset you can drop the table and then just Create table like all.requests. It gets truncated at the start of each crawl

@max-ostapenko
Copy link
Contributor

BTW did we want to cluster by the rank field instead of is_main_document? @rviscomi @tunetheweb

If yes, I'll run the following query to get staging ready for the next crawl:

DROP TABLE `httparchive.crawl_staging.requests`;

CREATE TABLE `httparchive.crawl_staging.requests`
(
  date DATE NOT NULL OPTIONS(description="YYYY-MM-DD format of the HTTP Archive monthly crawl"),
  client STRING NOT NULL OPTIONS(description="Test environment: desktop or mobile"),
  page STRING NOT NULL OPTIONS(description="The URL of the page being tested"),
  is_root_page BOOL OPTIONS(description="Whether the page is the root of the origin."),
  root_page STRING NOT NULL OPTIONS(description="The URL of the root page being tested"),
  rank INT64 OPTIONS(description="Site popularity rank, from CrUX"),
  url STRING NOT NULL OPTIONS(description="The URL of the request"),
  is_main_document BOOL NOT NULL OPTIONS(description="Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"),
  type STRING OPTIONS(description="Simplified description of the type of resource (script, html, css, text, other, etc)"),
  index INT64 OPTIONS(description="The sequential 0-based index of the request"),
  payload STRING OPTIONS(description="JSON-encoded WebPageTest result data for this request"),
  summary STRING OPTIONS(description="JSON-encoded summarization of request data"),
  request_headers ARRAY<STRUCT<name STRING OPTIONS(description="Request header name"), value STRING OPTIONS(description="Request header value")>> OPTIONS(description="Request headers"),
  response_headers ARRAY<STRUCT<name STRING OPTIONS(description="Response header name"), value STRING OPTIONS(description="Response header value")>> OPTIONS(description="Response headers"),
  response_body STRING OPTIONS(description="Text-based response body")
)
PARTITION BY date
CLUSTER BY client, is_root_page, type, rank
OPTIONS(
  require_partition_filter=true
);

P.S. I'm trying to get the all dataset to a 'stable' version.

@pmeenan
Copy link
Member

pmeenan commented Sep 2, 2024

You shouldn't create the staging tables directly. Make whatever modification is needed to all.requests and then:

DROP TABLE `httparchive.crawl_staging.requests`;
CREATE TABLE `httparchive.crawl_staging.requests` LIKE `httparchive.all.requests`;

That will clone the structure of the all requests table and make sure they match.

@max-ostapenko
Copy link
Contributor

There is at least one more change pending in all.requests - #263.

And adjusting the staging schema is easier than reprocessing all historical data in PROD table.

So my idea was to get to the desired state in staging first, map it to production schema within the automated pipeline. And get production table reprocessing started as we have achieved the stable schema in staging.

IMHO it's gonna be faster and cheaper way to apply all pending schema changes to all.

See steps outline here.

@pmeenan
Copy link
Member

pmeenan commented Sep 5, 2024

Just following up on this quickly since the crawl will be kicking off next Tuesday (and the change has already landed in the agent). We either need to update the table before then or revert the agent change

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Sep 6, 2024

The staging table and copy queries updated.
We can maintain different schemas while the reprocessing of all.requests is in progress.

It should be all good if there is no automated action that prepares staging tables (may be also impacted by schema change).

@pmeenan
Copy link
Member

pmeenan commented Sep 6, 2024

Great, thanks. I was mostly worried about the staging table getting the column. I'll keep an eye on the crawl when it starts to make sure data is being written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants