Split the custom metrics JSON into structured fields #262

rviscomi · 2024-04-09T14:30:09Z

The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 7.91 TB and takes about 4 minutes.

The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.

An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.

So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.

As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS '''
  const topLevelMetrics = new Set([
    'performance'
  ]);
  try {
    custom_metrics = JSON.parse(custom_metrics);
  } catch {
    return {};
  }

  if (!custom_metrics) {
    return {};
  }

  const performance = JSON.stringify(custom_metrics.performance);
  delete custom_metrics.performance;

  const other = JSON.stringify(custom_metrics);

  return {
    performance,
    other
  }
''';

CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
AS SELECT
  * EXCEPT (custom_metrics),
  GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-03-01'

Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.

Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:

WITH lcp_stats AS (
  SELECT
    client,
    isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
    hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
  FROM
    `httparchive.scratchspace.custom_metrics_struct`
  WHERE
    date = '2024-03-01' AND
    is_root_page AND
    JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') = 'IMG'
)

This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.

The text was updated successfully, but these errors were encountered:

rviscomi · 2024-04-09T14:36:05Z

I'd consider this and #189 (add rank to the requests table) and #149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.

max-ostapenko · 2024-09-12T11:42:59Z

I see two ways we can optimize processing cost/performance of custom metrics:

split the popular custom metrics,
split the biggest ones.

I'm still looking into the most popular ones, but here are the biggest:

css-variables 38.06%
parsed_css 18.54%
responsive_images 12.95%

Note: we can add a record child column with ALTER, but can't remove it (at least currently).
So we should split only the future-proof metrics as record fields. It's much easier with non-record columns.

Some insights into the metrics sizes

CREATE TEMP FUNCTION calculate_object_size(cm STRING)
RETURNS ARRAY<STRUCT<metric STRING, chars INT64>>
LANGUAGE js AS """
  let result = [];
  cm = JSON.parse(cm);
  for (const property in cm) {
    let length = JSON.stringify(cm[property]).length;
    result.push({ metric: property, chars: length });
  }
  return result;
""";

SELECT
  size.metric,
  APPROX_QUANTILES(size.chars, 2) AS chars_quantiles,
  SUM(size.chars) AS chars_sum,
  COUNT(DISTINCT page) AS cnt_pages
FROM `all.pages` TABLESAMPLE SYSTEM (3 PERCENT),
  UNNEST(calculate_object_size(custom_metrics)) AS size
WHERE date = '2024-07-01'
GROUP BY 1
ORDER BY chars_sum DESC

Questions:

@pmeenan isn't parsed_css split into a separate table? The content seems very similar, though the amount of data is not.
@tunetheweb is css-variables related to the parsed_css?

max-ostapenko · 2024-09-12T12:27:21Z

@tunetheweb Found an issue about parsed_css cleanup but was it not completed?

tunetheweb · 2024-09-12T13:07:25Z

Hmmm it appears some of this fails on SAFE.PARSE_JSON(custom_metrics) IS NOT NULL so it was excluded from the cleanup :-(

Doing this in JavaScript seems to work better from a quick test (I guess it has a better JSON parser than BigQuery) but is an awful lot slower :-(

CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    delete cm.parsed_css;
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'desktop' AND
  is_root_page

pmeenan · 2024-09-12T14:04:01Z

@pmeenan isn't parsed_css split into a separate table? The content seems very similar, though the amount of data is not.

parsed_css was removed from the custom metrics at the end of last month and should not be there in September's crawl.

tunetheweb · 2024-09-12T19:57:40Z

OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).

You stats are super interesting!

@tunetheweb is css-variables related to the parsed_css?

I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as is.

pmeenan · 2024-09-12T21:01:04Z

Also worth revisiting the metrics themselves to be certain they need all that data.

…

On Thu, Sep 12, 2024 at 3:58 PM Barry Pollard ***@***.***> wrote: OK parsed_css all removed now (minus 10 pages or so that have invalid JSON). You stats are super interesting! @tunetheweb <https://github.com/tunetheweb> is css-variables related to the parsed_css? I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as it. — Reply to this email directly, view it on GitHub <#262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADMOBL2MVXA25JC4ITVKCLZWHW4VAVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGEZDMMZRGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tunetheweb · 2024-09-12T21:18:34Z

True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:

            {
                "hasSrc": true,
                "hasAlt": true,
                "isInPicture": false,
                "hasCustomDataAttributes": false,
                "hasWidth": false,
                "hasHeight": false,
                "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
                "totalCandidates": 1,
                "altAttribute": "",
                "clientWidth": 950,
                "clientHeight": 200,
                "naturalWidth": 950,
                "naturalHeight": 200,
                "hasSrcset": false,
                "hasSizes": false,
                "currentSrcDensity": 1,
                "approximateResourceWidth": 950,
                "approximateResourceHeight": 200,
                "byteSize": 274637,
                "bitsPerPixel": 11.563663157894737,
                "mimeType": null,
                "computedSizingStyles": {
                    "width": "auto",
                    "height": "auto",
                    "maxWidth": "100%",
                    "maxHeight": "none",
                    "minWidth": "auto",
                    "minHeight": "auto"
                },
                "intrinsicOrExtrinsicSizing": {
                    "width": "both",
                    "height": "intrinsic"
                },
                "reservedLayoutDimensions": false
            },

pmeenan · 2024-09-12T21:45:14Z

Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated

…

On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard ***@***.***> wrote: True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!: { "hasSrc": true, "hasAlt": true, "isInPicture": false, "hasCustomDataAttributes": false, "hasWidth": false, "hasHeight": false, "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png", "totalCandidates": 1, "altAttribute": "", "clientWidth": 950, "clientHeight": 200, "naturalWidth": 950, "naturalHeight": 200, "hasSrcset": false, "hasSizes": false, "currentSrcDensity": 1, "approximateResourceWidth": 950, "approximateResourceHeight": 200, "byteSize": 274637, "bitsPerPixel": 11.563663157894737, "mimeType": null, "computedSizingStyles": { "width": "auto", "height": "auto", "maxWidth": "100%", "maxHeight": "none", "minWidth": "auto", "minHeight": "auto" }, "intrinsicOrExtrinsicSizing": { "width": "both", "height": "intrinsic" }, "reservedLayoutDimensions": false }, — Reply to this email directly, view it on GitHub <#262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pmeenan · 2024-09-12T21:46:36Z

I'd probably also want to make sure the URLs for things don't include full data URI's - maybe truncate anything data: to just that (or include a length).

…

On Thu, Sep 12, 2024 at 5:44 PM Patrick Meenan ***@***.***> wrote: Heck, even shortening the keys (and maybe keep a comment in the custom metric code) could have a big impact. like "nW" for "naturalWidth", "aW" for "approximateResourceWidth", etc. Though to be backward compatible it would make the queries a bit more complicated On Thu, Sep 12, 2024 at 5:18 PM Barry Pollard ***@***.***> wrote: > True dat! I don't know the CSS ones well enough but had a look at the > responsive images one, and it doesn't look hat unreasonable. But it does > stores quite a bit of meta data about every image on the page, and for > pages with many images that adds up!: > > { > "hasSrc": true, > "hasAlt": true, > "isInPicture": false, > "hasCustomDataAttributes": false, > "hasWidth": false, > "hasHeight": false, > "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png", > "totalCandidates": 1, > "altAttribute": "", > "clientWidth": 950, > "clientHeight": 200, > "naturalWidth": 950, > "naturalHeight": 200, > "hasSrcset": false, > "hasSizes": false, > "currentSrcDensity": 1, > "approximateResourceWidth": 950, > "approximateResourceHeight": 200, > "byteSize": 274637, > "bitsPerPixel": 11.563663157894737, > "mimeType": null, > "computedSizingStyles": { > "width": "auto", > "height": "auto", > "maxWidth": "100%", > "maxHeight": "none", > "minWidth": "auto", > "minHeight": "auto" > }, > "intrinsicOrExtrinsicSizing": { > "width": "both", > "height": "intrinsic" > }, > "reservedLayoutDimensions": false > }, > > — > Reply to this email directly, view it on GitHub > <#262 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AADMOBKFDKQ7UJ7VFFEP2DDZWIAL7AVCNFSM6AAAAABF6WLWK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGI3DEMJTGE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

tunetheweb · 2024-09-13T12:40:55Z

Heck, even shortening the keys (and maybe keep a comment in the custom
metric code) could have a big impact. like "nW" for "naturalWidth", "aW"
for "approximateResourceWidth", etc. Though to be backward compatible it
would make the queries a bit more complicated

Yup! JSON is way too chatty. Now I know how to update JSON we could look at shorting these down. Let's wait until Web Almanac calms down and do that.

I'd probably also want to make sure the URLs for things don't include full
data URI's - maybe truncate anything data: to just that (or include a
length).

Great shout. I checked and they ARE included. Fix in HTTPArchive/custom-metrics#143

max-ostapenko · 2024-09-19T21:15:32Z

Also worth revisiting the metrics themselves to be certain they need all
that data.

Yes, we should be able to identify and warn when storing such raw data to custom_metrics.

@tunetheweb Seems such data URLs are also a reason behind the huge size of css_variables metrics.

Example:

SELECT
  JSON_QUERY(custom_metrics, "$.css-variables.summary.--playkit-icon-quality-HD-active-url.set.0.value")
FROM `httparchive.sample_data.pages_10k`
WHERE
  client = 'mobile' AND
  page = "https://www.morgenpost.de/politik/article407241629/game-changer-fuer-die-ukraine-bahnt-sich-an-putin-droht.html"

max-ostapenko · 2024-09-20T09:29:59Z

After looking into frequency of custom_metrics usage, it doesn't seem as there is much usage outside of Web Almanac.

These are the most popular as of now:

ecommerce
almanac (Deprecate the almanac custom metric custom-metrics#88)
element_count
markup
wpt_bodies
performance
cookies

Log analytics query:

SELECT	
  metrics,
  --ARRAY_AGG(DISTINCT project IGNORE NULLS) AS projects,
  ARRAY_AGG(DISTINCT principal_email IGNORE NULLS) AS accounts,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
),
 UNNEST(REGEXP_EXTRACT_ALL(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), r"\(custom_metrics\,\ '\$\.([a-z0-9_]+)")) AS metrics
WHERE
   REGEXP_CONTAINS(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), "custom_metrics")
GROUP BY 1
ORDER BY cnt DESC

And we know external analysts who use custom metrics, but don't know which metrics exactly:

SELECT	
  project,
  principal_email,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    COALESCE(JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)), REGEXP_EXTRACT(JSON_VALUE(proto_payload.audit_log.metadata.tableDataRead.jobName), '^projects/([^/]+)/jobs/')) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
)
WHERE
  "custom_metrics" IN UNNEST(JSON_VALUE_ARRAY(proto_payload.audit_log.metadata.tableDataRead.fields)) AND
  project != 'httparchive'
GROUP BY 1,2
ORDER BY cnt DESC

Updated insights spreadsheet
We can get down to 30% of column size if we review the metrics themselves (~900Gb for is_root_page AND client = 'mobile').

@rviscomi why did you consider javascript, media, and performance as 'core' and ready to be split?

eeeps · 2024-10-01T22:56:58Z

As long as we're talking about the responsive images metrics specifically - IIRC there is some overlap between Images.js, media.js, responsive-images.js, and img-loading-atrr.js. I have thought about trying to consolidate but was worried about disrupting analysis over time.

max-ostapenko · 2024-10-07T12:06:02Z

@eeeps if you are working with images metrics data and have enhancement suggestions - it may be a good time to look at it now.
We will be processing historical data to update the table schema, so metrics could be consolidated retrospectively.

max-ostapenko mentioned this issue Sep 2, 2024

Stable all.requests HTTPArchive/dataform#5

Merged

4 tasks

max-ostapenko mentioned this issue Sep 9, 2024

Stable all.pages HTTPArchive/dataform#8

Merged

6 tasks

tunetheweb mentioned this issue Sep 13, 2024

Limit responsive image URLs HTTPArchive/custom-metrics#143

Merged

tunetheweb mentioned this issue Sep 21, 2024

Limit data URLs in css-variables HTTPArchive/custom-metrics#145

Merged

rviscomi mentioned this issue Oct 7, 2024

Extract page-level CrUX object to top-level schema HTTPArchive/dataform#16

Closed

max-ostapenko closed this as completed in HTTPArchive/dataform#8 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split the custom metrics JSON into structured fields #262

Split the custom metrics JSON into structured fields #262

rviscomi commented Apr 9, 2024 •

edited by tunetheweb

Loading

rviscomi commented Apr 9, 2024 •

edited

Loading

max-ostapenko commented Sep 12, 2024 •

edited

Loading

max-ostapenko commented Sep 12, 2024

tunetheweb commented Sep 12, 2024 •

edited

Loading

pmeenan commented Sep 12, 2024

tunetheweb commented Sep 12, 2024 •

edited

Loading

pmeenan commented Sep 12, 2024 via email

tunetheweb commented Sep 12, 2024

pmeenan commented Sep 12, 2024 via email

pmeenan commented Sep 12, 2024 via email

tunetheweb commented Sep 13, 2024

max-ostapenko commented Sep 19, 2024

max-ostapenko commented Sep 20, 2024

eeeps commented Oct 1, 2024

max-ostapenko commented Oct 7, 2024

Split the custom metrics JSON into structured fields #262

Split the custom metrics JSON into structured fields #262

Comments

rviscomi commented Apr 9, 2024 • edited by tunetheweb Loading

rviscomi commented Apr 9, 2024 • edited Loading

max-ostapenko commented Sep 12, 2024 • edited Loading

max-ostapenko commented Sep 12, 2024

tunetheweb commented Sep 12, 2024 • edited Loading

pmeenan commented Sep 12, 2024

tunetheweb commented Sep 12, 2024 • edited Loading

pmeenan commented Sep 12, 2024 via email

tunetheweb commented Sep 12, 2024

pmeenan commented Sep 12, 2024 via email

pmeenan commented Sep 12, 2024 via email

tunetheweb commented Sep 13, 2024

max-ostapenko commented Sep 19, 2024

max-ostapenko commented Sep 20, 2024

eeeps commented Oct 1, 2024

max-ostapenko commented Oct 7, 2024

rviscomi commented Apr 9, 2024 •

edited by tunetheweb

Loading

rviscomi commented Apr 9, 2024 •

edited

Loading

max-ostapenko commented Sep 12, 2024 •

edited

Loading

tunetheweb commented Sep 12, 2024 •

edited

Loading

tunetheweb commented Sep 12, 2024 •

edited

Loading