Metadata-driven framework for ingesting data into Databricks using Lakehouse Federation. Supports the following ingestion patterns:
- Full: ingests entire table
- Incremental: ingests incrementally using watermarks
- Partitioned: spreads ingestion across many small queries, run N at a time. Used for large tables. See diagram below.
The following sources are currently supported:
- SQL Server
- Oracle
- PostgreSQL
- Redshift
- Synapse
Follow the Lakehouse Federation instructions to create a connection and foreign catalog
Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/install.html
Choose one of the following authentication methods:
-
Generate Personal Access Token:
- Log into your Databricks workspace
- Click on your username in the top-right corner
- Select User Settings → Developer → Access tokens
- Click Generate new token
- Give it a name (e.g., "Local Development") and set expiration
- Copy the generated token
-
Configure CLI with PAT:
databricks configure --token --profile DEFAULT
You'll be prompted for:
- Databricks Host:
https://your-workspace.cloud.databricks.com - Token: Paste your generated token
This will update DEFAULT profile in
~/.databrickscfg - Databricks Host:
Configure OAuth:
databricks auth login --host https://your-workspace.cloud.databricks.com --profile PRODThis will:
- Open your browser for authentication
- Create a profile in
~/.databrickscfg - Store OAuth credentials securely
Check your configuration:
# List all profiles
cat ~/.databrickscfgYour ~/.databrickscfg should look like:
[DEFAULT]
host = https://your-workspace.cloud.databricks.com
token = dapi123abc...
[DEV]
host = https://dev-workspace.cloud.databricks.com
token = dapi456def...
[PROD]
host = https://prod-workspace.cloud.databricks.com
token = databricks-cliCreate and activate a Python virtual environment to manage dependencies:
# Create virtual environment on macOS/Linux
# See link above for Windows documentation
$ python3 -m venv .venv
# Activate virtual environment
$ source .venv/bin/activate
# Install required Python packages
$ pip install -r requirements-dev.txtUpdate the variables in databricks.yml to match your environment.
- workspace.host: Your Databricks workspace URL
- cluster_id: ID of your cluster for production deployment. For development, the bundle will lookup the ID based on the specified name (Eg, Shared Cluster).
- warehouse_id: ID of your SQL warehouse for production deployment. For development, the bundle will lookup the ID based on the specified name (Eg, Shared Serverless).
- concurrency: Concurrency of for each tasks. Can be overridden during deployment.
Example configuration for dev target:
targets:
dev:
mode: development
default: true
workspace:
host: https://your-workspace.cloud.databricks.com
variables:
cluster_id: your_cluster_id
warehouse_id: your_warehouse_id
concurrency: 16The solution is driven by metadata stored in a control table. In this table you can specific sources and sinks, loading behavior (Full, incremental, partitioned), etc.
- Create the control table using the _create_control_table notebook.
- Merge metadata into the control table. See the load_metadata_tpcds notebook for an example.
Some sources require additional configuration in order to retrieve table sizes for partitioned ingestion:
Oracle
Ingesting from Oracle requires permission to read the sys.dba_segments table. This is to obtain the source table size.
PostgreSQL
The number of queries used for ingestion is determined in part by the size of the source table. Since Lakehouse Federation doesn't currently support PostgreSQL object size functions (E.g., pg_table_size), you need to create a view in the source database or use JDBC pushdown. Creating a view in the source database is strongly recommended.
- Database view - create a view in the source database using the statement below. Leave the
jdbc_config_filejob parameter blank, and the view will be queried using Lakehouse Federation.
create or replace view public.vw_pg_table_size
as
select
table_schema,
table_name,
pg_table_size(quote_ident(table_name)),
pg_size_pretty(pg_table_size(quote_ident(table_name))) as pg_table_size_pretty
from information_schema.tables
where table_schema not in ('pg_catalog', 'information_schema')
and table_type = 'BASE TABLE';- JDBC pushdown - create a config file like config/postgresql_jdbc.json. Use the path to the file as the value for the
jdbc_config_filejob parameter. Secrets must be used for JDBC credentials. See notebooks/manage_secrets.ipynb for reference.
- Run the lakefed_ingest_controller job, providing the desired task_collection as a parameter.
- The lakefed_ingest_controller job will run all non-partitioned tasks, followed by all partitioned tasks. Non-partitioned tasks run concurrently, and partitioned tasks run sequentially. This is because partitioned tasks will spawn concurrent queries, and we want to maintain a consistent level of concurrency at the controller job (And source system) scope.
- Use a partition column with a relatively even distribution. If the partition column is also used in an index, that is even better.
- Use a small all-purpose cluster if you have partitioned ingestion tasks. This cluster is used only for configuring partitions (Not heavy data processing), and we don't want to wait for a job cluster to spin up for each partitioned ingestion task.
- Does not handle skew. The solution works best when the partition column has an even distribution.
- Does not provide atomicity. Individual queries are not executed as a single transaction. One could fail while the rest succeed, or the source table could be altered before all ingestion queries are completed.
$ databricks bundle deploy --target dev --profile DEFAULTNote: Since "dev" is specified as the default target in databricks.yml, you can omit the --target dev parameter. Similarly, --profile DEFAULT can be omitted if you only have one profile configured for your workspace.
This deploys everything that's defined for this project, including:
- Three jobs prefixed with
lakefed_ingest_ - main.py module for the partitioned ingest job
- All associated resources
You can find the deployed job by opening your workspace and clicking on Workflows.
$ databricks bundle deploy --target prod --profile PROD$ databricks bundle run --target prod --profile PRODFor enhanced development experience, consider installing:
- Databricks extension for Visual Studio Code: https://docs.databricks.com/dev-tools/vscode-ext.html
For comprehensive documentation on:
- Databricks Asset Bundles: https://docs.databricks.com/dev-tools/bundles/index.html
- CI/CD configuration: https://docs.databricks.com/dev-tools/bundles/index.html
assets/: Images for READMEconfig/: Config for PostgreSQL JDBC pushdownnotebooks/: Notebooks showing how to load metadata and work with Databricks Secretsresources/: Databricks Asset Bundle resource definitionssrc/: Source files including notebooks, SQL files, and Python modulesdatabricks.yml: Main bundle configuration file
Follow the instructions above in the "Set up Python Virtual Environment" section.
Databricks Connect is required to run some of the unit tests.
- Install dependent packages:
$ pip install -r requirements-dev.txt
- Run unit tests with pytest
$ pytest
If you run into this error:
ERROR tests/main_test.py - Exception: Cluster id or serverless are required but were not specified.
Add the cluster_id to your .databrickscfg file
[DEFAULT]
host = https://your-workspace.cloud.databricks.com
cluster_id = XXXX-XXXXXX-XXXXXXXX
auth_type = databricks-cli
Databricks support doesn't cover this content. For questions or bugs, please open a GitHub issue and the team will help on a best effort basis.
© 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License. All included or referenced third party libraries are subject to the licenses set forth below.
| library | description | license | source |
|---|---|---|---|
| pytest | Testing framework | MIT | GitHub |
| setuptools | Build system | MIT | GitHub |
| wheel | CLI for manipulating wheel files | MIT | GitHub |
| jsonschema | JSON schema validation | MIT | GitHub |
