-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split the custom metrics JSON into structured fields #262
Comments
I see two ways we can optimize processing cost/performance of custom metrics:
I'm still looking into the most popular ones, but here are the biggest:
Note: we can add a record child column with Some insights into the metrics sizes CREATE TEMP FUNCTION calculate_object_size(cm STRING)
RETURNS ARRAY<STRUCT<metric STRING, chars INT64>>
LANGUAGE js AS """
let result = [];
cm = JSON.parse(cm);
for (const property in cm) {
let length = JSON.stringify(cm[property]).length;
result.push({ metric: property, chars: length });
}
return result;
""";
SELECT
size.metric,
APPROX_QUANTILES(size.chars, 2) AS chars_quantiles,
SUM(size.chars) AS chars_sum,
COUNT(DISTINCT page) AS cnt_pages
FROM `all.pages` TABLESAMPLE SYSTEM (3 PERCENT),
UNNEST(calculate_object_size(custom_metrics)) AS size
WHERE date = '2024-07-01'
GROUP BY 1
ORDER BY chars_sum DESC Questions:
|
@tunetheweb Found an issue about |
Hmmm it appears some of this fails on Doing this in JavaScript seems to work better from a quick test (I guess it has a better JSON parser than BigQuery) but is an awful lot slower :-( CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
try {
cm = JSON.parse(cm);
delete cm.parsed_css;
return JSON.stringify(cm);
} catch {
return ''
}
""";
UPDATE
`httparchive.all.pages`
SET
custom_metrics = remove_parsed_css(custom_metrics)
WHERE
date = '2024-08-01' AND
custom_metrics LIKE '%parsed_css%' AND
remove_parsed_css(custom_metrics) != '' AND
client = 'desktop' AND
is_root_page |
OK parsed_css all removed now (minus 10 pages or so that have invalid JSON). You stats are super interesting!
I think so. Maybe it makes sense to move that into the |
Also worth revisiting the metrics themselves to be certain they need all
that data.
…On Thu, Sep 12, 2024 at 3:58 PM Barry Pollard ***@***.***> wrote:
OK parsed_css all removed now (minus 10 pages or so that have invalid
JSON).
You stats are super interesting!
@tunetheweb <https://github.com/tunetheweb> is css-variables related to
the parsed_css?
I think so. Maybe it makes sense to move that into the parsed_css table
too and have a type column to differentiate them? The fact this is > 50%
of the custom_metric would really help slim the custom_metrics down to a
more reasonable size. Then maybe we could keep the rest in there? Or could
move a few of the chunkier ones (responsive images, almanac...etc) to their
own column and keep the rest as it.
—
Reply to this email directly, view it on GitHub
<#262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADMOBL2MVXA25JC4ITVKCLZWHW4VAVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGEZDMMZRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:
|
Heck, even shortening the keys (and maybe keep a comment in the custom
metric code) could have a big impact. like "nW" for "naturalWidth", "aW"
for "approximateResourceWidth", etc. Though to be backward compatible it
would make the queries a bit more complicated
…On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard ***@***.***> wrote:
True dat! I don't know the CSS ones well enough but had a look at the
responsive images one, and it doesn't look hat unreasonable. But it does
stores quite a bit of meta data about every image on the page, and for
pages with many images that adds up!:
{
"hasSrc": true,
"hasAlt": true,
"isInPicture": false,
"hasCustomDataAttributes": false,
"hasWidth": false,
"hasHeight": false,
"url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
"totalCandidates": 1,
"altAttribute": "",
"clientWidth": 950,
"clientHeight": 200,
"naturalWidth": 950,
"naturalHeight": 200,
"hasSrcset": false,
"hasSizes": false,
"currentSrcDensity": 1,
"approximateResourceWidth": 950,
"approximateResourceHeight": 200,
"byteSize": 274637,
"bitsPerPixel": 11.563663157894737,
"mimeType": null,
"computedSizingStyles": {
"width": "auto",
"height": "auto",
"maxWidth": "100%",
"maxHeight": "none",
"minWidth": "auto",
"minHeight": "auto"
},
"intrinsicOrExtrinsicSizing": {
"width": "both",
"height": "intrinsic"
},
"reservedLayoutDimensions": false
},
—
Reply to this email directly, view it on GitHub
<#262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'd probably also want to make sure the URLs for things don't include full
data URI's - maybe truncate anything data: to just that (or include a
length).
…On Thu, Sep 12, 2024 at 5:44 PM Patrick Meenan ***@***.***> wrote:
Heck, even shortening the keys (and maybe keep a comment in the custom
metric code) could have a big impact. like "nW" for "naturalWidth", "aW"
for "approximateResourceWidth", etc. Though to be backward compatible it
would make the queries a bit more complicated
On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard ***@***.***>
wrote:
> True dat! I don't know the CSS ones well enough but had a look at the
> responsive images one, and it doesn't look hat unreasonable. But it does
> stores quite a bit of meta data about every image on the page, and for
> pages with many images that adds up!:
>
> {
> "hasSrc": true,
> "hasAlt": true,
> "isInPicture": false,
> "hasCustomDataAttributes": false,
> "hasWidth": false,
> "hasHeight": false,
> "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
> "totalCandidates": 1,
> "altAttribute": "",
> "clientWidth": 950,
> "clientHeight": 200,
> "naturalWidth": 950,
> "naturalHeight": 200,
> "hasSrcset": false,
> "hasSizes": false,
> "currentSrcDensity": 1,
> "approximateResourceWidth": 950,
> "approximateResourceHeight": 200,
> "byteSize": 274637,
> "bitsPerPixel": 11.563663157894737,
> "mimeType": null,
> "computedSizingStyles": {
> "width": "auto",
> "height": "auto",
> "maxWidth": "100%",
> "maxHeight": "none",
> "minWidth": "auto",
> "minHeight": "auto"
> },
> "intrinsicOrExtrinsicSizing": {
> "width": "both",
> "height": "intrinsic"
> },
> "reservedLayoutDimensions": false
> },
>
> —
> Reply to this email directly, view it on GitHub
> <#262 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Yup! JSON is way too chatty. Now I know how to update JSON we could look at shorting these down. Let's wait until Web Almanac calms down and do that.
Great shout. I checked and they ARE included. Fix in HTTPArchive/custom-metrics#143 |
Yes, we should be able to identify and warn when storing such raw data to custom_metrics. @tunetheweb Seems such data URLs are also a reason behind the huge size of Example: SELECT
JSON_QUERY(custom_metrics, "$.css-variables.summary.--playkit-icon-quality-HD-active-url.set.0.value")
FROM `httparchive.sample_data.pages_10k`
WHERE
client = 'mobile' AND
page = "https://www.morgenpost.de/politik/article407241629/game-changer-fuer-die-ukraine-bahnt-sich-an-putin-droht.html" |
After looking into frequency of These are the most popular as of now:
Log analytics query: SELECT
metrics,
--ARRAY_AGG(DISTINCT project IGNORE NULLS) AS projects,
ARRAY_AGG(DISTINCT principal_email IGNORE NULLS) AS accounts,
COUNT(DISTINCT insert_id) AS cnt
FROM (
SELECT
JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)) AS project,
proto_payload.audit_log.authentication_info.principal_email AS principal_email,
insert_id,
proto_payload
FROM `httparchive.global._Default._Default`
WHERE
proto_payload.audit_log.service_name="bigquery.googleapis.com"
),
UNNEST(REGEXP_EXTRACT_ALL(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), r"\(custom_metrics\,\ '\$\.([a-z0-9_]+)")) AS metrics
WHERE
REGEXP_CONTAINS(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), "custom_metrics")
GROUP BY 1
ORDER BY cnt DESC And we know external analysts who use custom metrics, but don't know which metrics exactly: SELECT
project,
principal_email,
COUNT(DISTINCT insert_id) AS cnt
FROM (
SELECT
COALESCE(JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)), REGEXP_EXTRACT(JSON_VALUE(proto_payload.audit_log.metadata.tableDataRead.jobName), '^projects/([^/]+)/jobs/')) AS project,
proto_payload.audit_log.authentication_info.principal_email AS principal_email,
insert_id,
proto_payload
FROM `httparchive.global._Default._Default`
WHERE
proto_payload.audit_log.service_name="bigquery.googleapis.com"
)
WHERE
"custom_metrics" IN UNNEST(JSON_VALUE_ARRAY(proto_payload.audit_log.metadata.tableDataRead.fields)) AND
project != 'httparchive'
GROUP BY 1,2
ORDER BY cnt DESC Updated insights spreadsheet @rviscomi why did you consider javascript, media, and performance as 'core' and ready to be split? |
As long as we're talking about the responsive images metrics specifically - IIRC there is some overlap between Images.js, media.js, responsive-images.js, and img-loading-atrr.js. I have thought about trying to consolidate but was worried about disrupting analysis over time. |
@eeeps if you are working with images metrics data and have enhancement suggestions - it may be a good time to look at it now. |
The
custom_metrics
field of thepages
table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 7.91 TB and takes about 4 minutes.The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.
An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named
other
.So instead of a single
custom_metrics
field of typeSTRING
, there would be acustom_metrics
field of typeSTRUCT
containing named fields corresponding to the core custom metrics.As a proof of concept, here's a query that creates a scratchspace table with the
performance
custom metric extracted into its own field, with everything else in another
field:Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.
Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:
This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.
The text was updated successfully, but these errors were encountered: