Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split the custom metrics JSON into structured fields #262

Closed
rviscomi opened this issue Apr 9, 2024 · 15 comments · Fixed by HTTPArchive/dataform#8
Closed

Split the custom metrics JSON into structured fields #262

rviscomi opened this issue Apr 9, 2024 · 15 comments · Fixed by HTTPArchive/dataform#8

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 9, 2024

The custom_metrics field of the pages table is a JSON blob containing all 50+ custom metrics. Querying ANY custom metric is as expensive as querying ALL custom metrics. As of March 2024, querying over all custom metrics (desktop and mobile, root and secondary pages) processes 7.91 TB and takes about 4 minutes.

The reasoning for having all custom metrics in a big blob as opposed to a well-defined BigQuery struct was to avoid having to change the schema whenever custom metrics were added/removed. This provides simplicity and consistency for queries that process data over many months.

An alternative approach that both reduces query costs and minimizes schema changes would be to extract a few core custom metrics and make them available in a struct of smaller blobs. The core custom metrics would include ones like javascript.js, media.js, and performance.js. As a rule of thumb, custom metrics corresponding to individual chapters in the Web Almanac could be eligible for this core subset. All remaining custom metrics would be made available in a JSON blob named other.

So instead of a single custom_metrics field of type STRING, there would be a custom_metrics field of type STRUCT containing named fields corresponding to the core custom metrics.


As a proof of concept, here's a query that creates a scratchspace table with the performance custom metric extracted into its own field, with everything else in an other field:

CREATE TEMP FUNCTION GET_CUSTOM_METRICS(custom_metrics STRING)
RETURNS STRUCT<performance STRING, other STRING> LANGUAGE js AS '''
  const topLevelMetrics = new Set([
    'performance'
  ]);
  try {
    custom_metrics = JSON.parse(custom_metrics);
  } catch {
    return {};
  }

  if (!custom_metrics) {
    return {};
  }

  const performance = JSON.stringify(custom_metrics.performance);
  delete custom_metrics.performance;

  const other = JSON.stringify(custom_metrics);

  return {
    performance,
    other
  }
''';

CREATE OR REPLACE TABLE `httparchive.scratchspace.custom_metrics_struct`
PARTITION BY date
CLUSTER BY client, is_root_page, rank
AS SELECT
  * EXCEPT (custom_metrics),
  GET_CUSTOM_METRICS(custom_metrics) AS custom_metrics
FROM
  `httparchive.all.pages`
WHERE
  date = '2024-03-01'

Running an example query over the existing 2024-03-01 dataset processes 4.35 TB in 4 min 33 sec.

Here's the relevant part of that example query showing how it would look using the new schema in the scratchspace table:

WITH lcp_stats AS (
  SELECT
    client,
    isLazyLoaded(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS native_lazy,
    hasLazyHeuristics(JSON_EXTRACT(custom_metrics.performance, '$.lcp_elem_stats.attributes')) AS custom_lazy
  FROM
    `httparchive.scratchspace.custom_metrics_struct`
  WHERE
    date = '2024-03-01' AND
    is_root_page AND
    JSON_EXTRACT_SCALAR(custom_metrics.performance, '$.lcp_elem_stats.nodeName') = 'IMG'
)

This query returns the same result, but only processes 123.78 GB in 53 sec. Or, 3% of the amount of data processed in 19% of the time with no loss of quality.

@rviscomi
Copy link
Member Author

rviscomi commented Apr 9, 2024

I'd consider this and #189 (add rank to the requests table) and #149 (optimizing summary fields) to be the last schema changes before considering the new all dataset relatively stable.

@max-ostapenko
Copy link
Contributor

max-ostapenko commented Sep 12, 2024

I see two ways we can optimize processing cost/performance of custom metrics:

  1. split the popular custom metrics,
  2. split the biggest ones.

I'm still looking into the most popular ones, but here are the biggest:

  1. css-variables 38.06%
  2. parsed_css 18.54%
  3. responsive_images 12.95%

Note: we can add a record child column with ALTER, but can't remove it (at least currently).
So we should split only the future-proof metrics as record fields. It's much easier with non-record columns.

Some insights into the metrics sizes

CREATE TEMP FUNCTION calculate_object_size(cm STRING)
RETURNS ARRAY<STRUCT<metric STRING, chars INT64>>
LANGUAGE js AS """
  let result = [];
  cm = JSON.parse(cm);
  for (const property in cm) {
    let length = JSON.stringify(cm[property]).length;
    result.push({ metric: property, chars: length });
  }
  return result;
""";

SELECT
  size.metric,
  APPROX_QUANTILES(size.chars, 2) AS chars_quantiles,
  SUM(size.chars) AS chars_sum,
  COUNT(DISTINCT page) AS cnt_pages
FROM `all.pages` TABLESAMPLE SYSTEM (3 PERCENT),
  UNNEST(calculate_object_size(custom_metrics)) AS size
WHERE date = '2024-07-01'
GROUP BY 1
ORDER BY chars_sum DESC

Questions:

  • @pmeenan isn't parsed_css split into a separate table? The content seems very similar, though the amount of data is not.
  • @tunetheweb is css-variables related to the parsed_css?

@max-ostapenko
Copy link
Contributor

@tunetheweb Found an issue about parsed_css cleanup but was it not completed?

@tunetheweb
Copy link
Member

tunetheweb commented Sep 12, 2024

Hmmm it appears some of this fails on SAFE.PARSE_JSON(custom_metrics) IS NOT NULL so it was excluded from the cleanup :-(

Doing this in JavaScript seems to work better from a quick test (I guess it has a better JSON parser than BigQuery) but is an awful lot slower :-(

CREATE TEMP FUNCTION remove_parsed_css(cm STRING)
RETURNS STRING
LANGUAGE js AS """
  try {
    cm = JSON.parse(cm);
    delete cm.parsed_css;
    return JSON.stringify(cm);
  } catch {
    return ''
  }
""";

UPDATE
  `httparchive.all.pages`
SET
  custom_metrics = remove_parsed_css(custom_metrics)
WHERE
  date = '2024-08-01' AND
  custom_metrics LIKE '%parsed_css%' AND
  remove_parsed_css(custom_metrics) != '' AND
  client = 'desktop' AND
  is_root_page

@pmeenan
Copy link
Member

pmeenan commented Sep 12, 2024

  • @pmeenan isn't parsed_css split into a separate table? The content seems very similar, though the amount of data is not.

parsed_css was removed from the custom metrics at the end of last month and should not be there in September's crawl.

@tunetheweb
Copy link
Member

tunetheweb commented Sep 12, 2024

OK parsed_css all removed now (minus 10 pages or so that have invalid JSON).

You stats are super interesting!

@tunetheweb is css-variables related to the parsed_css?

I think so. Maybe it makes sense to move that into the parsed_css table too and have a type column to differentiate them? The fact this is > 50% of the custom_metric would really help slim the custom_metrics down to a more reasonable size. Then maybe we could keep the rest in there? Or could move a few of the chunkier ones (responsive images, almanac...etc) to their own column and keep the rest as is.

@pmeenan
Copy link
Member

pmeenan commented Sep 12, 2024 via email

@tunetheweb
Copy link
Member

True dat! I don't know the CSS ones well enough but had a look at the responsive images one, and it doesn't look hat unreasonable. But it does stores quite a bit of meta data about every image on the page, and for pages with many images that adds up!:

            {
                "hasSrc": true,
                "hasAlt": true,
                "isInPicture": false,
                "hasCustomDataAttributes": false,
                "hasWidth": false,
                "hasHeight": false,
                "url": "http://www.turistickelisty.sportovnilisty.cz/img/_antialias_a8a44905-cc85-435b-bdb8-8a333c2a7498_acee3968cd95feb2c683ac90ecad8b1d.png",
                "totalCandidates": 1,
                "altAttribute": "",
                "clientWidth": 950,
                "clientHeight": 200,
                "naturalWidth": 950,
                "naturalHeight": 200,
                "hasSrcset": false,
                "hasSizes": false,
                "currentSrcDensity": 1,
                "approximateResourceWidth": 950,
                "approximateResourceHeight": 200,
                "byteSize": 274637,
                "bitsPerPixel": 11.563663157894737,
                "mimeType": null,
                "computedSizingStyles": {
                    "width": "auto",
                    "height": "auto",
                    "maxWidth": "100%",
                    "maxHeight": "none",
                    "minWidth": "auto",
                    "minHeight": "auto"
                },
                "intrinsicOrExtrinsicSizing": {
                    "width": "both",
                    "height": "intrinsic"
                },
                "reservedLayoutDimensions": false
            },

@pmeenan
Copy link
Member

pmeenan commented Sep 12, 2024 via email

@pmeenan
Copy link
Member

pmeenan commented Sep 12, 2024 via email

@tunetheweb
Copy link
Member

Heck, even shortening the keys (and maybe keep a comment in the custom
metric code) could have a big impact. like "nW" for "naturalWidth", "aW"
for "approximateResourceWidth", etc. Though to be backward compatible it
would make the queries a bit more complicated

Yup! JSON is way too chatty. Now I know how to update JSON we could look at shorting these down. Let's wait until Web Almanac calms down and do that.

I'd probably also want to make sure the URLs for things don't include full
data URI's - maybe truncate anything data: to just that (or include a
length).

Great shout. I checked and they ARE included. Fix in HTTPArchive/custom-metrics#143

@max-ostapenko
Copy link
Contributor

Also worth revisiting the metrics themselves to be certain they need all
that data.

Yes, we should be able to identify and warn when storing such raw data to custom_metrics.

@tunetheweb Seems such data URLs are also a reason behind the huge size of css_variables metrics.

Example:

SELECT
  JSON_QUERY(custom_metrics, "$.css-variables.summary.--playkit-icon-quality-HD-active-url.set.0.value")
FROM `httparchive.sample_data.pages_10k`
WHERE
  client = 'mobile' AND
  page = "https://www.morgenpost.de/politik/article407241629/game-changer-fuer-die-ukraine-bahnt-sich-an-putin-droht.html"

@max-ostapenko
Copy link
Contributor

After looking into frequency of custom_metrics usage, it doesn't seem as there is much usage outside of Web Almanac.

These are the most popular as of now:

  1. ecommerce
  2. almanac (Deprecate the almanac custom metric custom-metrics#88)
  3. element_count
  4. markup
  5. wpt_bodies
  6. performance
  7. cookies

Log analytics query:

SELECT	
  metrics,
  --ARRAY_AGG(DISTINCT project IGNORE NULLS) AS projects,
  ARRAY_AGG(DISTINCT principal_email IGNORE NULLS) AS accounts,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
),
 UNNEST(REGEXP_EXTRACT_ALL(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), r"\(custom_metrics\,\ '\$\.([a-z0-9_]+)")) AS metrics
WHERE
   REGEXP_CONTAINS(JSON_VALUE(proto_payload.audit_log.service_data.jobInsertRequest.resource.jobConfiguration.query.query), "custom_metrics")
GROUP BY 1
ORDER BY cnt DESC

And we know external analysts who use custom metrics, but don't know which metrics exactly:

SELECT	
  project,
  principal_email,
  COUNT(DISTINCT insert_id) AS cnt
FROM (
  SELECT
    COALESCE(JSON_VALUE(COALESCE(proto_payload.audit_log.service_data.jobCompletedEvent.job.jobName.projectId, proto_payload.audit_log.service_data.jobInsertResponse.resource.jobName.projectId, proto_payload.audit_log.service_data.jobGetQueryResultsResponse.job.jobName.projectId)), REGEXP_EXTRACT(JSON_VALUE(proto_payload.audit_log.metadata.tableDataRead.jobName), '^projects/([^/]+)/jobs/')) AS project,
    proto_payload.audit_log.authentication_info.principal_email AS principal_email,
    insert_id,
    proto_payload
  FROM `httparchive.global._Default._Default`
  WHERE
    proto_payload.audit_log.service_name="bigquery.googleapis.com"
)
WHERE
  "custom_metrics" IN UNNEST(JSON_VALUE_ARRAY(proto_payload.audit_log.metadata.tableDataRead.fields)) AND
  project != 'httparchive'
GROUP BY 1,2
ORDER BY cnt DESC

Updated insights spreadsheet
We can get down to 30% of column size if we review the metrics themselves (~900Gb for is_root_page AND client = 'mobile').

@rviscomi why did you consider javascript, media, and performance as 'core' and ready to be split?

@eeeps
Copy link

eeeps commented Oct 1, 2024

As long as we're talking about the responsive images metrics specifically - IIRC there is some overlap between Images.js, media.js, responsive-images.js, and img-loading-atrr.js. I have thought about trying to consolidate but was worried about disrupting analysis over time.

@max-ostapenko
Copy link
Contributor

@eeeps if you are working with images metrics data and have enhancement suggestions - it may be a good time to look at it now.
We will be processing historical data to update the table schema, so metrics could be consolidated retrospectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants