Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions skills/cloud/bigquery-ai-ml/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ metadata:
description: >-
Leverages BigQuery's built-in machine learning and GenAI capabilities
for advanced data analytics. Use when you need to write SQL queries
that perform time-series forecasting, detect outliers, or leverage generative AI capabilities in BigQuery.
that perform time-series forecasting, detect outliers, find key drivers, or leverage generative AI capabilities in BigQuery.
---

# BigQuery AI & ML

BigQuery integrates with Vertex AI to provide powerful machine learning and
generative AI capabilities directly within SQL queries using built-in functions
like `AI.FORECAST`, `AI.DETECT_ANOMALIES`, and `AI.GENERATE`.
like `AI.FORECAST`, `AI.KEY_DRIVERS`, `AI.DETECT_ANOMALIES`, and `AI.GENERATE`.

## Reference Directory

Expand All @@ -25,6 +25,9 @@ like `AI.FORECAST`, `AI.DETECT_ANOMALIES`, and `AI.GENERATE`.
- [AI Generate](references/ai_generate.md): General-purpose text and
content generation using Gemini models.

- [AI Key Drivers](references/ai_key_drivers.md): Automatically identify
dimensional segments most responsible for driving changes in a metric.

## Related Skills

- [BigQuery Basics Skill](../bigquery-basics):
Expand Down
97 changes: 97 additions & 0 deletions skills/cloud/bigquery-ai-ml/references/ai_key_drivers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# BigQuery AI.Key_Drivers

`AI.KEY_DRIVERS` automatically identifies the key dimensional segments most
responsible for driving changes in a specified metric between a defined interest
group and a reference group.

## Syntax Reference

```sql
SELECT
*
FROM
AI.KEY_DRIVERS(
{ TABLE TABLE | (QUERY_STATEMENT) },
metric_col => 'METRIC_COL',
dimension_cols => DIMENSION_COLS,
interest_label_col => 'INTEREST_LABEL_COL'
[, min_apriori_support => MIN_APRIORI_SUPPORT]
[, top_k => TOP_K]
[, enable_pruning => ENABLE_PRUNING]
)
```

### Input Arguments

Argument | Requirement | Type | Description
:------------------------ | :----------- | :-------------- | :----------
**`input_data`** | **Required** | | The source table or subquery containing the data to analyze.
**`metric_col`** | **Required** | `STRING` | Metric column name. Must be of type: INT64, NUMERIC, BIGNUMERIC, or FLOAT64.
**`interest_label_col`** | **Required** | `STRING` | Boolean column name: `TRUE` for interest group, `FALSE` for reference group.
**`dimension_cols`** | **Required** | `ARRAY<STRING>` | 1-12 dimension columns (INT64, BOOL, STRING); cannot be `metric_col` or `interest_label_col`.
**`min_apriori_support`** | Optional | `FLOAT64` | Minimum apriori support threshold [0, 1] for output segments. Default: 0.1. Cannot be used with `top_k`.
**`top_k`** | Optional | `INT64` | Return top k insights [1, 1M] by apriori support. If unset, uses `min_apriori_support=0.1`. Cannot be used with `min_apriori_support`.
**`enable_pruning`** | Optional | `BOOL` | If `TRUE` (default), redundant insights are pruned. If `FALSE`, all insights meeting thresholds are returned. Two segments are redundant if two conditions are met: 1) their metric values are equal 2) The dimensions and corresponding values of one row are a subset of the dimensions and corresponding values of the other. In this case, the row with more dimensions (the more descriptive row) is kept.

### Output Schema

Returns a `STRUCT` with the following fields:

Column Name | Type | Description
:----------------------------------- | :-------------- | :----------
**`drivers`** | `ARRAY<STRING>` | Provides a list of drivers, or dimension values of interest, which describes each of the segments.
**`metric_interest`** | `NUMERIC` | The sum of the metric_column for the data in the interest segment.
**`metric_reference`** | `NUMERIC` | The sum of the metric_column for data in the reference segment.
**`difference`** | `NUMERIC` | The difference between the interest and reference metric values for a segment.
**`relative_difference`** | `NUMERIC` | The relative change of a segment, calculated as the difference divided by the reference metric value.
**`unexpected_difference`** | `NUMERIC` | Measures deviation of segment from the rest of the population's growth. Calculated as: (segment relative_difference - complement relative_difference) * segment reference metric.
**`relative_unexpected_difference`** | `NUMERIC` | The unexpected_difference divided by the expected interest metric value for a segment.
**`contribution`** | `NUMERIC` | Contains the absolute value of the difference value: `ABS(difference)`.
**`apriori_support`** | `NUMERIC` | Segment size relative to the total population (filters small segments).

## Examples

### Identifying Key Drivers in 2024 H2 Liquor Sales

```sql
WITH InputData AS (
SELECT
sale_dollars,
city,
category_name,
vendor_name,
(date > '2024-07-01') AS IS_H2
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE EXTRACT(YEAR FROM DATE) = 2024
)
SELECT *
FROM AI.KEY_DRIVERS(
TABLE InputData,
metric_col => 'sale_dollars',
dimension_cols => ['city', 'vendor_name', 'category_name'],
interest_label_col => 'IS_H2',
min_apriori_support => 0
);
```

### Finding Top 5 Key Drivers in Bottle Costs

```sql
SELECT *
FROM AI.KEY_DRIVERS(
(
SELECT
state_bottle_cost,
city,
category_name,
(EXTRACT(MONTH FROM date) >= 7) AS is_h2
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE EXTRACT(YEAR FROM date) = 2024
),
metric_col => 'state_bottle_cost',
dimension_cols => ['city', 'category_name'],
interest_label_col => 'is_h2',
top_k => 5
);
```