Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions skills/cloud/datalineage-bigquery-asset-impact-analysis/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
name: datalineage-bigquery-asset-impact-analysis
description: >-
Performs a deep impact analysis for a broken, stale, or modified BigQuery data asset.
Identifies all downstream tables, dashboards, and processes that will be affected (the blast radius).
---

# BigQuery Asset Impact Analysis

This skill guides the agent in performing a downstream impact analysis (blast
radius assessment) when a BigQuery table or view is reported as broken, stale,
missing, or when a user is planning maintenance and wants to know the
consequences of modifying or pausing updates to an asset.

It relies primarily on the **Google Cloud Data Lineage (Dataplex) MCP Server**
to discover relationships between assets.

## Prerequisites

This skill requires access to the Google Cloud Data Lineage API and an active
client connection to the Data Lineage MCP Server. For detailed connection
configurations and tool schemas, refer to [MCP Usage](references/mcp-usage.md).

## Analysis Workflow

### 1. Resolve the Asset Fully Qualified Name (FQN)

* Ensure you have the correct FQN format for the BigQuery asset:
* *Format:* `bigquery:{project_id}.{dataset_id}.{table_or_view_id}`
* *Example:* `bigquery:my-prod-project.analytics.orders`
* If the user provides a partial path (e.g., just `dataset.table`), attempt to
infer the active project from the environment or ask the user for
clarification.

### 2. Determine Locations and Parent Path

Identify the locations to search and construct the Data Lineage API request:

* **Discover Asset Location**: Run the command `bq show --format=json
{project_id}:{dataset_id}` and extract the `location` field (e.g.,
`us-central1` or `us`). If location discovery fails due to permissions or
missing tools, prompt the user for the dataset's location.
* **Set Parent Path**: Set the `parent` path to match the location where your
Data Lineage MCP server is configured (e.g.,
`projects/{project_id}/locations/{mcp_server_location}`).
* **Configure Search Scope**: Include the discovered asset location in the
`locations` array of the payload (e.g., `["us-central1"]` or `["us",
"us-central1"]`).

### 3. Retrieve the Downstream Lineage Graph

Call the `DataLineageServer:search_lineage` tool to fetch downstream
relationships.

* **Direction**: Set to `DOWNSTREAM`.
* **Search Parameters**: Use `max_depth = 10` and `max_process_per_link = 5`
as robust defaults.

### 4. Identify the Blast Radius

Traverse the returned lineage links to build the impact graph:

* **Affected Assets**: The `target` of each link represents a downstream asset
that depends on your source asset.
* **Transform Processes**: Inspect the `processes` field on each link. This
identifies the ETL pipelines, BigQuery Views, or Scheduled Queries that
propagate the data.
* **Direct vs. Indirect Impact**:
* **Direct Impact (Depth 1)**: Assets directly consuming the source asset.
If a link has `dependency_type: EXACT_COPY`, mark the target as
"Directly Stale / Identical Copy".
* **Indirect Impact (Depth > 1)**: Assets further down the stream that
will experience cascading stale data or failures.

### 5. Summarize and Format the Output

Present your findings clearly to the user using the following structure:

1. **Executive Summary**: State the total number of downstream assets affected
and the maximum depth of the impact.
2. **Critical Path**: Highlight high-priority downstream assets (e.g., assets
containing "prod", "dashboard", "reporting", or "master" in their names).
3. **Blast Radius Table**: A clean markdown table listing the dependencies. You
MUST include all columns: | Downstream Asset | Transform Process | Depth |
Impact Type | | :--- | :--- | :--- | :--- | |
`bigquery:project.dataset.table` | `projects/p/locations/l/processes/proc` |
1 | Direct | | `bigquery:project.dataset.view` |
`projects/p/locations/l/processes/view` | 2 | Indirect |
4. **Analysis Metadata**: Provide transparency on the parameters and boundaries
of your search so the user can choose to expand them:
* **Locations Searched**: `{list_of_locations_queried}`
* **Parent Location**: `{parent_path}`
* **Depth Limit**: `{max_depth}`
* **Process per Link Limit**: `{max_process_per_link}`
* *Tip for User*: Let the user know they can request to rerun the analysis
with expanded locations or larger depth limits.

## Crucial Constraints & Guardrails

1. **Interpret Empty Responses Correctly**:
* An empty response from `search_lineage` indicates that no lineage links
are currently recorded for this asset in the queried locations. Do not
assume the tool is broken or delayed; report the lack of dependencies
clearly and proceed.
2. **Strictly Banned Bypasses**:
* Always use the `DataLineageServer:search_lineage` MCP tool. Avoid
attempting fallback queries using `gcloud alpha datalineage` or raw
`curl` API calls, as these will fail in the agent sandbox due to
authentication and mTLS restrictions.
3. **Verify Asset Existence First**:
* If `bq show` indicates the source table does not exist, stop and report
this directly to the user. Do not attempt to guess alternative table
names unless the user explicitly instructs you to do so.
4. **No Output Shortcutting or Hallucinated Artifacts**:
* Present the complete downstream blast radius table directly in your
final response. Avoid telling the user you have created a separate
markdown file or artifact containing the details unless you have
explicitly executed file-writing tools to create it.

## Reference Directory

- [MCP Usage](references/mcp-usage.md): Using the Google Cloud Data Lineage
remote MCP server and tool preferences.

## External Documentation

- [Google Cloud Dataplex Data Lineage Documentation](https://cloud.google.com/dataplex/docs/data-lineage)
- [Use the Data Lineage MCP server](https://docs.cloud.google.com/dataplex/docs/use-lineage-mcp)
- [Dataplex Data Lineage API Reference](https://cloud.google.com/dataplex/docs/reference/data-lineage/rest)
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Google Cloud Data Lineage MCP Usage

The Data Lineage service is supported by a remote Model Context Protocol (MCP)
server that provides structured tools for discovering relationships between data
assets.

## MCP Tools for Data Lineage

- **search_lineage**: Performs a breadth-first search (upstream or downstream)
to retrieve lineage links for an asset identified by its Fully Qualified
Name (FQN). Supports Column-Level Lineage (CLL).

### Tool Preference Hierarchy

* Default: Data Lineage MCP tool (`DataLineageServer:search_lineage`) > `bq`
CLI > `gcloud` CLI.
* Note: `gcloud` does NOT support lineage links searches in standard/beta
tracks. Always prefer the MCP tool.

## Setup Instructions

To connect to the Data Lineage MCP server, see
[Use the Data Lineage MCP server](https://docs.cloud.google.com/dataplex/docs/use-lineage-mcp).

For client configuration, add the following block to your agent's MCP
configuration file (e.g., `mcp_config.json`):

```json
{
"mcpServers": {
"DataLineageServer": {
"serverUrl": "https://datalineage.googleapis.com/mcp",
"authProviderType": "google_credentials",
"headers": {
"x-goog-user-project": "<GCP_PROJECT_ID>"
}
}
}
}
```