-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rank
field to all.requests
#189
Comments
It would also be helpful to run queries on a subset of all requests (to reduce costs) and filter on |
@pmeenan I was about to JOIN |
@max-ostapenko once the field exists in the table, the client-side schemas are here and then the value itself would get added here |
Seems straightforward there, and probably better to have it throughout all pipeline. |
It doesn't look like the table itself has the column yet. I don't think it will be added automatically and will probably fail if the column isn't added before the next crawl (and I'm not sure if the schema is sensitive to the field ordering) |
Do we need an existing data in the "crawl_staging" dataset? |
No, assuming the data has already been processed and is in the all dataset you can drop the table and then just Create table like all.requests. It gets truncated at the start of each crawl |
BTW did we want to cluster by the If yes, I'll run the following query to get staging ready for the next crawl: DROP TABLE `httparchive.crawl_staging.requests`;
CREATE TABLE `httparchive.crawl_staging.requests`
(
date DATE NOT NULL OPTIONS(description="YYYY-MM-DD format of the HTTP Archive monthly crawl"),
client STRING NOT NULL OPTIONS(description="Test environment: desktop or mobile"),
page STRING NOT NULL OPTIONS(description="The URL of the page being tested"),
is_root_page BOOL OPTIONS(description="Whether the page is the root of the origin."),
root_page STRING NOT NULL OPTIONS(description="The URL of the root page being tested"),
rank INT64 OPTIONS(description="Site popularity rank, from CrUX"),
url STRING NOT NULL OPTIONS(description="The URL of the request"),
is_main_document BOOL NOT NULL OPTIONS(description="Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"),
type STRING OPTIONS(description="Simplified description of the type of resource (script, html, css, text, other, etc)"),
index INT64 OPTIONS(description="The sequential 0-based index of the request"),
payload STRING OPTIONS(description="JSON-encoded WebPageTest result data for this request"),
summary STRING OPTIONS(description="JSON-encoded summarization of request data"),
request_headers ARRAY<STRUCT<name STRING OPTIONS(description="Request header name"), value STRING OPTIONS(description="Request header value")>> OPTIONS(description="Request headers"),
response_headers ARRAY<STRUCT<name STRING OPTIONS(description="Response header name"), value STRING OPTIONS(description="Response header value")>> OPTIONS(description="Response headers"),
response_body STRING OPTIONS(description="Text-based response body")
)
PARTITION BY date
CLUSTER BY client, is_root_page, type, rank
OPTIONS(
require_partition_filter=true
); P.S. I'm trying to get the |
You shouldn't create the staging tables directly. Make whatever modification is needed to all.requests and then: DROP TABLE `httparchive.crawl_staging.requests`;
CREATE TABLE `httparchive.crawl_staging.requests` LIKE `httparchive.all.requests`; That will clone the structure of the all requests table and make sure they match. |
There is at least one more change pending in And adjusting the staging schema is easier than reprocessing all historical data in PROD table. So my idea was to get to the desired state in staging first, map it to production schema within the automated pipeline. And get production table reprocessing started as we have achieved the stable schema in staging. IMHO it's gonna be faster and cheaper way to apply all pending schema changes to See steps outline here. |
Just following up on this quickly since the crawl will be kicking off next Tuesday (and the change has already landed in the agent). We either need to update the table before then or revert the agent change |
The staging table and copy queries updated. It should be all good if there is no automated action that prepares staging tables (may be also impacted by schema change). |
Great, thanks. I was mostly worried about the staging table getting the column. I'll keep an eye on the crawl when it starts to make sure data is being written. |
I often find myself joining pages and requests tables just to sort by rank. This field is page-specific but it'd be helpful to have this directly available in the requests table.
The text was updated successfully, but these errors were encountered: