Know the cost of a study before you run it.
Preflight is a deterministic, zero-LLM SQL pre-execution analyzer. Give it a SQL query plus a lightweight metadata catalog (and, optionally, a read-only database connection) and it produces a structured report — data lineage, scale/cost estimate, execution risk, bottlenecks, optimization recommendations, live query-plan facts, and a confidence score — without ever executing the query.
It was built for clinical research data warehouses (OMOP, Epic Clarity/Caboodle, PCORnet), where a single generated query can scan billions of rows, but it works on any SQL.
A clinical data scientist is drafting an AKI-cohort extract for the shared, nightly-refreshed OMOP warehouse. Before touching it, they triage all their draft queries in one shot — offline, in ~2 seconds, with zero warehouse load:
The naive first draft would have full-scanned a 1.5-billion-row table. Preflight flags it
BLOCK and says exactly how to fix it — no query ever runs:
| Draft | Verdict | Est. output | Runtime |
|---|---|---|---|
SELECT * FROM measurement (naive) |
🔴 BLOCK | 1,500,000,000 | EXTREME |
| concept + Q1-date scoped extract | 🟢 GO (caution) | ~8,000 (live-plan confirmed) | FAST |
~187,000× fewer rows — seen before execution. No EDW slot consumed, no waiting for the nightly refresh, no DBA escalation: just a go/no-go and a concrete fix list. (Offline, the catalog gives a deliberately conservative estimate; attaching a live plan tightens it and raises confidence from MEDIUM → HIGH.)
- Deterministic & zero-LLM. Every result comes from rules, catalog metadata, and (optionally) the database's own query planner — no model in the path, nothing to hallucinate. Enforced by a guard test.
- Never executes your query. Live analysis uses only
EXPLAIN-class statements (PostgresEXPLAIN (FORMAT JSON), DuckDBEXPLAIN, SQL ServerSET SHOWPLAN_XML ON) — the estimated plan, never the actual run. - Offline-first. The core analysis needs no database connection at all — just the SQL and a catalog YAML. Useful when the warehouse is mid-ETL or otherwise unavailable.
- Schema-agnostic & dialect-pluggable. Schemas are plain YAML data; SQL dialects come from sqlglot; connectors implement one small interface.
python -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt # sqlglot, duckdb, pydantic, pyyaml
# optional live connectors:
# pip install "psycopg[binary]" # Postgres
# pip install pyodbc # SQL Server (needs an ODBC driver)# Offline (catalog-only) — no database needed:
python -m preflight.cli check my_query.sql --catalog omop --dialect duckdb
# With a live estimated plan, using the bundled synthetic OMOP DuckDB (no server):
python -m preflight.cli check fixtures/queries/crrt_flowsheet.sql \
--catalog omop --target omop --duckdb-fixture
# JSON output:
python -m preflight.cli check my_query.sql --catalog omop --format jsonAn interactive, color terminal view for vetting queries and making a fast go / no-go call:
Single-query triage
Batch worklist (a folder of queries, sorted worst-risk-first)
# Single query — a triage panel with a GO / REVIEW / BLOCK verdict:
python -m preflight.cli tui my_query.sql --catalog clarity
# A whole folder — a worklist of all *.sql, sorted worst-risk-first; pick one to drill in:
python -m preflight.cli tui ./queries/ --catalog omop --dialect duckdb
# Attach a live plan (any read-only connector) and add --no-input for scripted/CI use:
python -m preflight.cli tui report.sql --catalog clarity --sqlserver-dsn "..." --no-inputVerdicts map from risk: LOW → GO, MEDIUM → GO (caution), HIGH → REVIEW, CRITICAL → BLOCK
(an EXTREME runtime escalates a borderline case). Requires rich (in requirements.txt).
| Section | What it tells you |
|---|---|
| Study Summary | execution target, tables touched, domains, query shape |
| Data Lineage | tables, join edges, cardinality transitions (fan-out), filters |
| Scale Estimation | patients / encounters / events / output rows; runtime category (FAST…EXTREME) |
| Execution Risk | LOW · MEDIUM · HIGH · CRITICAL, with reasons |
| Bottleneck Analysis | tables ranked by estimated row contribution |
| Optimization Recommendations | structured, actionable suggestions |
| Query Plan (live) | when a connector is attached: scan/seek/join ops, est. rows, index usage, missing-index hints |
| Confidence | LOW (sparse catalog) · MEDIUM (full catalog) · HIGH (catalog + live plan) |
from preflight import run_preflight
from preflight.contracts import GeneratedSQL
from preflight.catalog.loader import load_catalog
report = run_preflight(
GeneratedSQL(query=sql_text, dialect="duckdb", target="omop"),
load_catalog("omop"),
connector=None, # or a read-only connector (see below)
)
print(report.risk_level, report.runtime_category, report.confidence)
for b in report.bottlenecks:
print(b.component, b.contribution_pct)Render a report with preflight.report.render.render_text(report) / render_json(report).
| Database | How | CLI flag |
|---|---|---|
| DuckDB | in-process EXPLAIN |
--duckdb-path file.duckdb (or --duckdb-fixture for the synthetic OMOP demo) |
| Postgres | EXPLAIN (FORMAT JSON) |
--postgres-dsn "postgresql://user@host/db" |
| SQL Server | SET SHOWPLAN_XML ON (estimated plan) |
--sqlserver-dsn "Driver={ODBC Driver 18 for SQL Server};Server=...;Database=...;UID=...;PWD=..." |
preflight check report.sql --catalog clarity \
--sqlserver-dsn "Driver={ODBC Driver 18 for SQL Server};Server=...;Database=Clarity;UID=...;PWD=..."Connectors only issue plan/EXPLAIN statements — they never run the analyzed query and never read row data.
Bundled seed catalogs: omop, epic, clarity (Epic Clarity-derived EDW),
caboodle (Epic dimensional), pcornet. A catalog is just a YAML file describing tables
(category, volume, risk, row_estimate), joins, and column selectivity.
The bootstrapper introspects your database's system catalogs (table list + row-count estimates — no data, no query execution) and writes a catalog YAML draft:
# DuckDB / Postgres now; SQL Server (Epic) needs `pip install pyodbc`
preflight catalog-bootstrap \
--sqlserver-dsn "Driver={ODBC Driver 18 for SQL Server};Server=...;Database=Clarity;..." \
--schema-name clarity --heuristic epic --stats-as-of 2026-06-01
# writes ~/.preflight/catalogs/clarity.yaml (review the categories/risk before relying on it)Catalogs in $PREFLIGHT_CATALOG_DIR (default ~/.preflight/catalogs/) override the bundled seeds
and are found automatically by name. Run the bootstrapper once per warehouse refresh; day-to-day
check runs fully offline against the cached catalog.
Put defaults in a (gitignored) .env so you don't repeat flags or expose credentials on the
command line:
PREFLIGHT_CATALOG=clarity
PREFLIGHT_DIALECT=tsql
PREFLIGHT_CATALOG_DIR=~/.preflight/catalogs
PREFLIGHT_SQLSERVER_DSN=Driver={ODBC Driver 18 for SQL Server};Server=...;Database=Clarity;...
Resolution precedence: CLI flag → environment / .env → catalog default_dialect → built-in default.
- Add a schema: drop a YAML file in
preflight/catalog/schemas/<name>.yaml(or$PREFLIGHT_CATALOG_DIR). No code change. - Add a connector: implement the
Connectorprotocol inpreflight/connector/base.py— a singleanalyze(sql) -> PlanFacts, EXPLAIN-class only, never executing the query.
- No query execution; connectors use estimated-plan statements only.
- No LLM and no network clients on the core path (guarded by a test).
- Plan predicate literals are redacted before they appear in any report (so values that could be identifiers never leak into output).
- Credentials belong in
.env/environment, not on the command line. - See
docs/privacy-and-data-handling-audit.mdfor the full data-handling audit.
. .venv/bin/activate && python -m pytestLive database tests are skipped unless PREFLIGHT_PG_DSN / PREFLIGHT_SQLSERVER_DSN are set.
preflight/ core package (parser, catalog, lineage, estimate, risk, …)
catalog/ YAML schema catalogs + loader + bootstrapper
connector/ read-only DuckDB / Postgres / SQL Server connectors
report/ text + JSON renderers
fixtures/ synthetic OMOP DuckDB + sample queries (no real data)
docs/ design specs, plans, and the privacy audit
tests/ pytest suite
Preflight is a deferred feature of PSDL Workbench, developed standalone here so it can be used
and tested on its own. Design specs and implementation plans live under docs/superpowers/.