Skip to content

Feature Request: Comprehensive Metadata Discovery for Data Development Tools #118

@mullinsms

Description

@mullinsms

Summary

The ADBC BigQuery driver currently supports discovery of projects, datasets, tables (including views, materialized views, external tables, clones, snapshots), columns, and primary/foreign key constraints. However, BigQuery has additional object types and metadata that data development tools need — routines (functions and procedures), ML models, partition details, DDL, reverse foreign keys, and richer table/dataset properties. This issue requests expanding metadata discovery to cover these gaps.

We build a collaborative data operations platform and currently use the BigQuery JDBC driver supplemented by INFORMATION_SCHEMA queries and the Google Cloud BigQuery Java API to extract metadata. We are evaluating the ADBC driver as a modern Arrow-native replacement but cannot adopt it without coverage for the object categories listed below.

Each section describes what we need, why it matters, what metadata properties are required, and how the data can be obtained from BigQuery.


Table of Contents

  1. Stored Procedures
  2. Scalar Functions (UDFs)
  3. Table Functions
  4. Routine Parameters
  5. ML Models
  6. Partition Metadata
  7. DDL / CREATE Statement Retrieval
  8. Reverse Foreign Keys (Referenced By)
  9. Extended Table Statistics
  10. Dataset Properties
  11. Extended External Table Properties
  12. Extended Materialized View Properties
  13. Table Options (Labels, Expiration, Encryption)
  14. Row-Level Security Policies
  15. Column-Level Security (Policy Tags)
  16. Authorized Views & Datasets
  17. Search Indexes

1. Stored Procedures

Motivation

Stored procedures are a core part of BigQuery's programmability model. They support SQL and JavaScript, can accept IN/OUT/INOUT parameters, and are used extensively for ETL orchestration and data transformations. Data development tools need to enumerate them for catalog browsing, autocomplete, and documentation.

Required Metadata

Property Description Source Column
Routine catalog Project containing the procedure routine_catalog
Routine schema Dataset containing the procedure routine_schema
Routine name Procedure identifier routine_name
Routine type PROCEDURE routine_type
Routine body SQL or EXTERNAL routine_body
External language JAVASCRIPT (if external) external_language
DDL Complete CREATE PROCEDURE statement ddl
Description User-provided description From ROUTINE_OPTIONS
Is deterministic Whether the procedure is deterministic is_deterministic
Created Creation timestamp created
Last altered Last modification timestamp last_altered
Security type INVOKER or DEFINER security_type

How to Obtain

SELECT
    r.routine_catalog,
    r.routine_schema,
    r.routine_name,
    r.routine_type,
    r.routine_body,
    r.external_language,
    r.is_deterministic,
    r.security_type,
    r.created,
    r.last_altered,
    r.ddl
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.ROUTINES r
WHERE
    r.routine_type = 'PROCEDURE'
ORDER BY
    r.routine_name;

Options (description, library URIs) are available from:

SELECT option_name, option_value
FROM `project`.`dataset`.INFORMATION_SCHEMA.ROUTINE_OPTIONS
WHERE specific_name = 'procedure_name';

2. Scalar Functions (UDFs)

Motivation

User-defined scalar functions are used extensively in BigQuery for reusable transformations. Tools need to discover them for autocomplete, signature help, and catalog browsing.

Required Metadata

Property Description Source Column
Routine catalog Project containing the function routine_catalog
Routine schema Dataset containing the function routine_schema
Routine name Function identifier routine_name
Routine type FUNCTION routine_type
Return data type Function return type data_type
Routine body SQL or EXTERNAL routine_body
External language JAVASCRIPT (if external) external_language
DDL Complete CREATE FUNCTION statement ddl
Description User-provided description From ROUTINE_OPTIONS
Is deterministic Whether the function is deterministic is_deterministic
Created Creation timestamp created
Last altered Last modification timestamp last_altered

How to Obtain

SELECT
    r.routine_catalog,
    r.routine_schema,
    r.routine_name,
    r.routine_type,
    r.data_type,
    r.routine_body,
    r.routine_definition,
    r.external_language,
    r.is_deterministic,
    r.created,
    r.last_altered,
    r.ddl
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.ROUTINES r
WHERE
    r.routine_type = 'FUNCTION'
ORDER BY
    r.routine_name;

3. Table Functions

Motivation

Table-valued functions return tabular results and are used in FROM clauses. They need to be distinguished from scalar functions because they have a result column schema rather than a single return type.

Required Metadata

Same as scalar functions (section 2), plus:

Property Description
Return columns List of result column names and types

How to Obtain

-- Identify table functions
SELECT *
FROM `project`.`dataset`.INFORMATION_SCHEMA.ROUTINES r
WHERE r.routine_type = 'TABLE FUNCTION';

Return column schema is embedded in the ddl column and can also be inferred from INFORMATION_SCHEMA.PARAMETERS entries with ordinal_position = 0 or from the routine definition.


4. Routine Parameters

Motivation

Parameter metadata enables signature help, call template generation, and documentation. Without parameter names and types, tools can only show the routine name.

Required Metadata

Property Description Source Column
Routine name Parent routine specific_name
Parameter name Parameter identifier parameter_name
Ordinal position 1-based position ordinal_position
Parameter mode IN, OUT, INOUT parameter_mode
Data type BigQuery type name data_type
Is result Whether this describes the return type parameter_mode = 'OUT' and ordinal_position = 0

How to Obtain

SELECT
    p.specific_catalog,
    p.specific_schema,
    p.specific_name,
    p.ordinal_position,
    p.parameter_name,
    p.parameter_mode,
    p.data_type
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.PARAMETERS p
WHERE
    p.ordinal_position > 0  -- exclude return type entry
ORDER BY
    p.specific_name,
    p.ordinal_position;

5. ML Models

Motivation

BigQuery ML is a core BigQuery feature that allows users to create, train, and deploy machine learning models using SQL. Models are first-class dataset objects. Tools need to enumerate them for catalog browsing and display model properties.

Required Metadata

Property Description Source
Model name Model identifier model_name
Model type LINEAR_REG, LOGISTIC_REG, KMEANS, BOOSTED_TREE, DNN_CLASSIFIER, ARIMA_PLUS, etc. model_type
Description User-provided description description
Created Creation timestamp creation_time
Last modified Last modification timestamp modified_time
Expiration Model expiration timestamp model_expiration_time
Labels Key-value labels labels
Feature columns Input feature column names and types feature_columns
Label columns Target label columns label_columns
Training runs List of training executions with timestamps training_runs
Location GCP region location
Encryption KMS key configuration encryption_configuration
Default trial ID Default trial for hyperparameter tuning default_trial_id

How to Obtain

Option A: INFORMATION_SCHEMA

SELECT
    model_name,
    model_type,
    schema_name,
    creation_time,
    modified_time,
    model_expiration_time,
    description,
    labels,
    feature_columns,
    label_columns,
    training_runs,
    location,
    encryption_configuration,
    default_trial_id
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.ML_MODELS;

Option B: BigQuery API

GET https://bigquery.googleapis.com/bigquery/v2/projects/{project}/datasets/{dataset}/models/{model}

The API returns the full model resource including training runs, evaluation metrics, and hyperparameters.

Note

ML models are not traditional tables and don't fit naturally into the GetObjects table hierarchy. Consider either:

  1. Including them as a special table type (e.g., "MODEL")
  2. Providing a separate GetModels() method
  3. Extending GetObjects with a new sibling to db_schema_tables

6. Partition Metadata

Motivation

BigQuery tables are commonly partitioned by time or integer range. Users need per-partition statistics for:

  • Understanding data distribution and freshness
  • Writing partition-pruning queries
  • Monitoring storage costs (active vs long-term tiers)
  • Identifying stale or oversized partitions

Required Metadata

Property Description Source Column
Table name Parent table table_name
Partition ID Partition identifier (e.g., "20240101", "NULL", "UNPARTITIONED") partition_id
Total rows Row count in partition total_rows
Total logical bytes Logical storage size total_logical_bytes
Total billable bytes Billable storage size total_billable_bytes
Last modified time Last data modification last_modified_time
Storage tier ACTIVE or LONG_TERM storage_tier

How to Obtain

SELECT
    p.table_name,
    p.partition_id,
    p.total_rows,
    p.total_logical_bytes,
    p.total_billable_bytes,
    p.last_modified_time,
    p.storage_tier
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.PARTITIONS p
WHERE
    p.table_name = 'table_name'
ORDER BY
    p.partition_id;

Aggregated Table Statistics from Partitions

SELECT
    p.table_name,
    COUNT(p.partition_id) AS partition_count,
    SUM(p.total_rows) AS total_rows,
    SUM(p.total_logical_bytes) AS total_logical_bytes,
    SUM(p.total_billable_bytes) AS total_billable_bytes,
    MAX(p.last_modified_time) AS last_modified_time
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.PARTITIONS p
GROUP BY
    p.table_name;

7. DDL / CREATE Statement Retrieval

Motivation

DDL captures the complete definition of any BigQuery object — tables, views, materialized views, external tables, functions, procedures, and models. It is essential for:

  • Schema documentation and version control
  • Environment migration (dev to staging to production)
  • Object recreation and cloning
  • Understanding full object configuration (partitioning, clustering, options)

Required Metadata

Property Description
Object name Fully qualified object name
Object type TABLE, VIEW, MATERIALIZED VIEW, FUNCTION, PROCEDURE, etc.
DDL text Complete CREATE statement

How to Obtain

BigQuery uniquely provides DDL as a column in INFORMATION_SCHEMA:

-- Tables, views, materialized views, external tables
SELECT table_name, table_type, ddl
FROM `project`.`dataset`.INFORMATION_SCHEMA.TABLES;

-- Routines (functions, procedures)
SELECT routine_name, routine_type, ddl
FROM `project`.`dataset`.INFORMATION_SCHEMA.ROUTINES;

Suggested Approach

The simplest integration is to include the ddl column from existing INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.ROUTINES queries and expose it as an additional field in GetObjects (e.g., table_ddl in TABLE_SCHEMA).


8. Reverse Foreign Keys (Referenced By)

Motivation

When inspecting a table, users need to see not only which tables it references (foreign keys) but also which tables reference it (reverse foreign keys / "referenced by"). This is essential for impact analysis — understanding what downstream tables would be affected by changes.

Required Metadata

Property Description
Referencing table Table that holds the foreign key
FK constraint name Foreign key constraint name
Referencing columns Columns in the referencing table
Referenced columns Columns in this table (the target)

How to Obtain

SELECT
    constr.table_name AS referencing_table,
    constr.constraint_name,
    col.column_name AS referencing_column,
    pk_col.column_name AS referenced_column
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE u
    INNER JOIN `project`.`dataset`.INFORMATION_SCHEMA.TABLE_CONSTRAINTS constr
        ON constr.constraint_name = u.constraint_name
        AND constr.constraint_type = 'FOREIGN KEY'
    INNER JOIN `project`.`dataset`.INFORMATION_SCHEMA.KEY_COLUMN_USAGE col
        ON col.constraint_name = constr.constraint_name
    INNER JOIN `project`.`dataset`.INFORMATION_SCHEMA.KEY_COLUMN_USAGE pk_col
        ON pk_col.table_name = 'referenced_table_name'
        AND pk_col.ordinal_position = col.position_in_unique_constraint
WHERE
    TRIM(u.table_name) = 'referenced_table_name'
ORDER BY
    constr.constraint_name,
    col.ordinal_position;

Note

This uses the same INFORMATION_SCHEMA tables already queried for forward foreign keys. It could be exposed as an additional constraint type or as a separate field on the table.


9. Extended Table Statistics

Motivation

The driver currently exposes num_rows from table metadata. Tools need richer storage statistics for cost monitoring, capacity planning, and performance optimization.

Required Metadata

Property Description Source
Total rows Row count PARTITIONS aggregate or num_rows
Total logical bytes Logical storage size PARTITIONS aggregate
Total billable bytes Billable storage size PARTITIONS aggregate
Total physical bytes Physical storage including compression TABLE_STORAGE
Active logical bytes Recently modified data (<90 days) TABLE_STORAGE
Long-term logical bytes Data unmodified >90 days TABLE_STORAGE
Time travel physical bytes Time travel window storage TABLE_STORAGE
Partition count Number of partitions PARTITIONS aggregate
Last modified time Most recent data change PARTITIONS aggregate

How to Obtain

-- Aggregated from PARTITIONS (most accurate row counts)
SELECT
    table_name,
    COUNT(*) AS partition_count,
    SUM(total_rows) AS total_rows,
    SUM(total_logical_bytes) AS total_logical_bytes,
    SUM(total_billable_bytes) AS total_billable_bytes,
    MAX(last_modified_time) AS last_modified_time
FROM `project`.`dataset`.INFORMATION_SCHEMA.PARTITIONS
GROUP BY table_name;

-- Detailed storage breakdown
SELECT
    table_name,
    total_logical_bytes,
    active_logical_bytes,
    long_term_logical_bytes,
    total_physical_bytes,
    active_physical_bytes,
    long_term_physical_bytes,
    time_travel_physical_bytes
FROM `project`.`dataset`.INFORMATION_SCHEMA.TABLE_STORAGE;

ADBC Spec Alignment

The ADBC GetStatistics() API (since v1.1.0) defines standard statistic keys for row count (key 6), null count (key 5), distinct count (key 1), etc. Row count and byte sizes would map naturally to this API.


10. Dataset Properties

Motivation

Datasets (schemas) in BigQuery carry important configuration metadata — location, default expiration, labels, access controls, and encryption settings. Tools need this for environment documentation and administrative workflows.

Required Metadata

Property Description
Dataset name Dataset identifier
Description User-provided description
Location GCP region (e.g., US, EU, us-central1)
Creation time When the dataset was created
Last modified time Last configuration change
Default table expiration Auto-delete interval for new tables
Default partition expiration Auto-delete interval for partitions
Labels Key-value labels
Default collation Default collation for new string columns
Max time travel hours Time travel window (48-168 hours)
Storage billing model LOGICAL or PHYSICAL
Is case insensitive Case sensitivity setting

How to Obtain

Option A: INFORMATION_SCHEMA

SELECT *
FROM `project`.INFORMATION_SCHEMA.SCHEMATA
WHERE schema_name = 'dataset_name';

SELECT *
FROM `project`.INFORMATION_SCHEMA.SCHEMATA_OPTIONS
WHERE schema_name = 'dataset_name';

Option B: BigQuery API

GET https://bigquery.googleapis.com/bigquery/v2/projects/{project}/datasets/{dataset}

The API returns the complete dataset resource including access controls and encryption configuration.


11. Extended External Table Properties

Motivation

External tables in BigQuery connect to data stored outside BigQuery (Cloud Storage, Google Sheets, Bigtable, etc.). Tools need to display the source configuration so users understand where data comes from and how it's parsed.

Required Metadata

Property Description
Source format AVRO, CSV, DATASTORE_BACKUP, GOOGLE_SHEETS, JSON, ORC, PARQUET, ICEBERG, DELTA
Source URIs Array of Cloud Storage URIs or Sheet URLs
Compression GZIP, SNAPPY, etc.
Encoding UTF-8, ISO-8859-1
Field delimiter CSV field delimiter
Skip leading rows Number of header rows to skip
Quote character CSV quote character
Allow jagged rows Whether ragged CSV rows are allowed
Allow quoted newlines Whether newlines in quoted fields are allowed
Max bad records Error tolerance threshold
Null marker String representing NULL
Hive partition URI prefix Hive-style partition detection
Require hive partition filter Whether partition filter is required
Sheet range Google Sheets range (if applicable)
JSON extension GEOJSON, etc.
Enable logical types Whether Avro/Parquet logical types are used
Decimal target types How decimals are mapped

How to Obtain

SELECT
    t.table_name,
    o.option_name,
    o.option_value
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.TABLES t
    JOIN `project`.`dataset`.INFORMATION_SCHEMA.TABLE_OPTIONS o
        ON o.table_name = t.table_name
WHERE
    t.table_type = 'EXTERNAL'
ORDER BY
    t.table_name, o.option_name;

Key option names include: format, uris, compression, encoding, field_delimiter, skip_leading_rows, quote, allow_jagged_rows, allow_quoted_newlines, max_bad_records, null_marker, hive_partition_uri_prefix, require_hive_partition_filter, sheet_range, json_extension, enable_logical_types, decimal_target_types.


12. Extended Materialized View Properties

Motivation

Materialized views in BigQuery auto-refresh and can significantly reduce query costs. Tools need to display refresh configuration and status.

Required Metadata

Property Description Source
View definition The SELECT query INFORMATION_SCHEMA.MATERIALIZED_VIEWS.definition
Enable refresh Whether auto-refresh is on TABLE_OPTIONS enable_refresh
Refresh interval Minutes between refreshes TABLE_OPTIONS refresh_interval_minutes
Last refresh time When the MV was last refreshed INFORMATION_SCHEMA.MATERIALIZED_VIEWS.last_refresh_time
Refresh watermark Freshness of the MV data INFORMATION_SCHEMA.MATERIALIZED_VIEWS.refresh_watermark

How to Obtain

SELECT
    table_name,
    definition,
    last_refresh_time,
    refresh_watermark
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.MATERIALIZED_VIEWS;

Combined with TABLE_OPTIONS for refresh configuration.


13. Table Options (Labels, Expiration, Encryption)

Motivation

BigQuery tables carry rich configuration via options. Tools need to display these for documentation, compliance auditing, and operational awareness.

Required Metadata

Property Description Option Name
Description Table description description
Expiration timestamp When the table auto-deletes expiration_timestamp
Partition expiration days When partitions auto-delete partition_expiration_days
Require partition filter Whether queries must filter on partition require_partition_filter
KMS key name Cloud KMS encryption key kms_key_name
Friendly name Display name friendly_name
Labels Key-value labels labels
Default rounding mode Rounding for NUMERIC/BIGNUMERIC default_rounding_mode
Max staleness BI Engine staleness tolerance max_staleness

How to Obtain

SELECT
    table_name,
    option_name,
    option_type,
    option_value
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.TABLE_OPTIONS
WHERE
    table_name = 'table_name';

14. Row-Level Security Policies

Motivation

BigQuery supports row-level access policies that filter rows based on the querying user's identity. Tools that display security configuration need to surface these policies.

Required Metadata

Property Description Source Column
Policy name Row policy identifier row_access_policy_name
Table name Table the policy applies to table_name
Filter predicate SQL expression controlling access filter_predicate
Grantees Users/groups/domains granted access grantee_list
Created Creation timestamp creation_time

How to Obtain

SELECT
    row_access_policy_name,
    table_catalog,
    table_schema,
    table_name,
    filter_predicate,
    grantee_list,
    creation_time
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES;

15. Column-Level Security (Policy Tags)

Motivation

BigQuery supports column-level security via Data Catalog policy tags. Columns can be tagged with policies that restrict access. Tools should indicate which columns have security policies applied.

Required Metadata

Property Description Source Column
Table name Parent table table_name
Column name Protected column column_name
Policy tag Data Catalog policy tag resource name policy_tag

How to Obtain

SELECT
    table_name,
    column_name,
    field_path,
    policy_tag
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
    policy_tag IS NOT NULL;

16. Authorized Views & Datasets

Motivation

BigQuery supports authorized views and authorized datasets — views/datasets that can access data in other datasets regardless of the querying user's permissions. This is a key security pattern. Tools should indicate which views are authorized.

Required Metadata

For authorized views on a dataset:

Property Description
Source dataset The dataset granting access
Authorized view Project.dataset.view that has been authorized

For authorized datasets:

Property Description
Source dataset The dataset granting access
Authorized dataset Project.dataset that has been authorized

How to Obtain

This metadata is available via the BigQuery API on the dataset resource's access field:

GET https://bigquery.googleapis.com/bigquery/v2/projects/{project}/datasets/{dataset}

The response includes an access array with entries of type view (for authorized views) and dataset (for authorized datasets).


17. Search Indexes

Motivation

BigQuery supports search indexes for efficient text search across STRING, JSON, and ARRAY columns. Tools should display which tables have search indexes and their configuration.

Required Metadata

Property Description Source Column
Index name Search index identifier index_name
Table name Indexed table table_name
Index columns Columns included in the index index_columns
Index type SEARCH index_type
Index status ACTIVE, PENDING, DISABLED, etc. index_status
Created Creation timestamp creation_time
Last modified Last modification timestamp last_modified_time
Coverage percentage Percentage of table data indexed coverage_percentage

How to Obtain

SELECT
    table_name,
    index_name,
    index_columns,
    index_type,
    index_status,
    creation_time,
    last_modified_time,
    coverage_percentage
FROM
    `project`.`dataset`.INFORMATION_SCHEMA.SEARCH_INDEXES;

Summary Priority Table

# Category BigQuery Source Impact
1 Stored Procedures INFORMATION_SCHEMA.ROUTINES High — core SQL development
2 Scalar Functions (UDFs) INFORMATION_SCHEMA.ROUTINES High — autocomplete, documentation
3 Table Functions INFORMATION_SCHEMA.ROUTINES High — SQL development
4 Routine Parameters INFORMATION_SCHEMA.PARAMETERS High — signature help
5 ML Models INFORMATION_SCHEMA.ML_MODELS / BigQuery API Medium — BQML users
6 Partition Metadata INFORMATION_SCHEMA.PARTITIONS High — cost & performance
7 DDL Retrieval INFORMATION_SCHEMA.TABLES.ddl / ROUTINES.ddl High — migration, documentation
8 Reverse Foreign Keys INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE Medium — impact analysis
9 Extended Table Statistics INFORMATION_SCHEMA.PARTITIONS / TABLE_STORAGE Medium — cost monitoring
10 Dataset Properties INFORMATION_SCHEMA.SCHEMATA / BigQuery API Medium — admin context
11 External Table Properties INFORMATION_SCHEMA.TABLE_OPTIONS Medium — data source config
12 Materialized View Properties INFORMATION_SCHEMA.MATERIALIZED_VIEWS Medium — refresh status
13 Table Options INFORMATION_SCHEMA.TABLE_OPTIONS Medium — configuration display
14 Row-Level Security Policies INFORMATION_SCHEMA.ROW_ACCESS_POLICIES Low — security audit
15 Column-Level Security INFORMATION_SCHEMA.COLUMN_FIELD_PATHS Low — security audit
16 Authorized Views & Datasets BigQuery API (dataset access field) Low — security audit
17 Search Indexes INFORMATION_SCHEMA.SEARCH_INDEXES Low — index awareness

We're happy to provide further details on any of these categories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions