The new schema and cost concerns for users

So we have a new schema in the `all` dataset which basically has two tables:
- `pages`
- `requests`

Previously we have had separate schemas for each data type (pages, requests, lighthouse, response bodies, technologies) and also summary tables (summary_pages, summary_requests).

There is a LOT to like about the new schema, including:
- Easier to query across dates without having to use `_TABLE_SUFFIX`
- Do not need to join as everything in one place
- Contains secondary pages

And there are some good cost benefits:
- The `pages` table is partitioned on `date` and clustered on `client`, `is_root_page`, and `rank`. This means, for these columns, you only pay for the rows you query. This saves time and real dollars. In the old schema you paid the full amount for every table you queried even if you were only getting a few rows.
- The `requests` table is partitioned on `date` and clustered on `client`, `is_root_page`, `is_main_document`, and `type`, with same benefits as above.
- You MUST supply a date `WHERE` clause. This prevents querying the whole table, which could be VERY expensive. The query will not run without a date. You can, however, use a date range (e.g. `WHERE date > '1900-01-01`) for trend queries - though you shouldn't do that for really expensive queries.
- A number of pieces of data are available in each of the two tables, avoiding you having to query the expensive `payload` column for common things. This was basically available before with the `summary` tables, but has been greatly enhanced. For pages we now pull our rank, all the previous summary data and custom metrics into their own columns. This can also be enhanced in future to pull out more things (e.g. page-level CrUX data, lighthouse scores, or particular audits).
- For requests we pull out type, summary data as before, and JSON-based request headers and response headers.

So we want to migrate people to the new schema as, in general, it's easier to use, and costs less (in dollars AND time).

However, I have one concern with the new schema and everything being in the one table, as opposed to split out before. This makes the (VERY!) expensive `payload`, `lighthouse` and `reponse_bodies` data much easier to query. I don't think below are good defaults:

![image](https://user-images.githubusercontent.com/10931297/197278233-47ad04eb-005a-49c7-b185-7e8d0a85285e.png)

![image](https://user-images.githubusercontent.com/10931297/197278202-4027e78d-95a4-49bf-b5cf-58fe457f0ff9.png)

Coming from a more traditional RDMS background, it's quite common in my experience to run a `SELECT *` on a table to view it, and I worry BigQuery newbies could do this and end up with HUGE bills. BigQuery does have the `Schema` and `Preview` options on tables and these are much better than using `SELECT *` but, as I say, not everyone is a BigQuery user.

We can (and have) insisted on a partition field (`date`) but BigQuery does not allow us to insist on columns being explicitly given so we cannot prevent people running above sort of `SELECT *` queries.

We have a number of options to address this concern of mine:
- Do nothing. There was always the risk of someone doing a large `SELECT` even in the old schema and in many ways it's better now, even if in some ways maybe it's worse.
- Remove the expensive columns from the `all.pages` and `all.requests` tables and switch back to having separate `payload`, `lighthouse` and `response_bodies` tables. This then means either 1) Extra joins or 2) Duplicating data so loses some of the benefits of the new Schema. It would also be a change to our pipeline.
- Create smaller, summary Views on the `all` tables within the `all` schema (e.g. `all.summary_pages`, `all.summary_requests`), without those expensive columns and encourage their use, over the `all.pages` and `all.requests` tables, especially for beginners. But is this more confusing and will it lead to people writing the same query in two different ways?
- Create smaller, broken up Views in a separate schema (e.g. a `httparchive.query` schema with broken down tables: `httparchive.query.summary_pages`, `httparchive.query.pages`, `httparchive.query.lighthouse`, `httparchive.query.summary_requests`, `httparchive.query.response_bodies`) and encourage their use, with people free to use the `all` schema if they need to, to avoid joins. We could even not bother with the `response_bodies` table in this basic Schema and say that's only for experts. But is this more confusing and will it lead to people writing the same query in two different ways?

The Views can be created once and will automatically update so I don't think maintenance is an issue.

For example, in the `latest` schema we currently have three new view that automatically look at the latest month's data and also look at a subset of the data:

```sql
CREATE OR REPLACE VIEW `httparchive.latest.pages` AS (
  SELECT
    * EXCEPT (lighthouse)
  FROM
    `httparchive.all.pages`
  WHERE
    -- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.lighthouse` AS (
  SELECT
    date,
    client,
    page,
    is_root_page,
    root_page,
    rank,
    wptid,
    lighthouse AS report,
    metadata
  FROM
    `httparchive.all.pages`
  WHERE
    -- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.requests` AS (
  SELECT
    * EXCEPT (response_body)
  FROM
    `httparchive.all.requests`
  WHERE
    -- latest date only (this part not shown for brevity)
)
```

I'd be interested to hear views on this (@rviscomi I know you have some as we've discussed), and whether we need to do anything for this?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The new schema and cost concerns for users #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The new schema and cost concerns for users #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions