A tool for migrating Airflow DAGs to Databricks Asset Bundles (DABs).

WFX streamlines the migration process from Airflow to Databricks by automatically converting Apache Airflow DAG configurations into Databricks Asset Bundle YAML files. The tool preserves the DAG's structure, dependencies, and parameters while adapting them to the Databricks workflow format.
wfx/
├── configs/ # Cluster configuration
│ └── clusters/ # Databricks cluster specifications
│ ├── dev.json # Development cluster config example
│ └── prod.json # Production cluster config example
├── inputs/
│ ├── airflow_dags/ # Source Airflow DAG files
│ │ └── data_processing/ # Example Airflow DAG directory
│ │ └── workflow_setting.py # Sample Airflow DAG
│ ├── dbx_workflows/ # Workflow API compatible JSON
│ │ └── data_processing_tasks_config.json # Generated task config json
│ └── mappings/ # Task mappings from Airflow to Databricks
│ └── data_processing/ # Example DAG mapping directory
│ └── task_list_all.csv # CSV mapping old Airflow tasks to new Databricks tasks
├── resources/
│ └── jobs/ # Output DABs YAML files
│ └── sample_databricks_etl.yml # Generated Databricks workflow yml
├── src/
│ └── wfx/
│ ├── constants/ # Global settings
│ │ ├── __init__.py
│ │ └── settings.py # Path configurations and constants
│ ├── core/ # Core conversion logic
│ │ ├── __init__.py
│ │ └── converter.py # DABs converter
│ └── processors/ # Input processors
│ ├── __init__.py
│ ├── airflow.py # Airflow DAG parser
│ └── gsheet.py # Mapping file reader
└── notebooks/ # Jupyter notebooks for interactive use
└── playbook.ipynb # Playbook / Sample conversion notebook
- Task Dependency Preservation: Maintains the exact dependency structure of Airflow DAGs
- Parameter Mapping: Converts Airflow task parameters to Databricks notebook parameters
- Flexible Configuration: Supports customization via mapping files (input task name, output task name, filepaths)
- Support for Multiple Task Types: Handles various Databricks task types (notebook, Python, SQL, etc.)
- CLI Interface: Easy-to-use command line tool + Python playbook (notebook)
- Python 3.11+
- Poetry (for dependency management)
-
Clone the repository:
git clone https://github.com/afaqueahmad7117/wfx.git cd wfx -
Install dependencies:
poetry install
-
Activate the Poetry environment:
poetry env activate # After running this command, something like the below will be printed, run it to activate the env # source /Users/.../<env-name>/bin/activate
-
Verify installation:
dabconvert --help
The tool is used as a command-line utility:
dabconvert -i ALL_DAGs_DIR \
-d DAG_NAME_DIR \
-o OUTPUT_DIR \
-w OUTPUT_WORKFLOW_NAME \
-c CONFIG_PATHMake sure you've activated the Poetry environment with poetry shell before running the command.
| Short Form | Long Form | Description |
|---|---|---|
-i |
--input-dir |
Directory containing Airflow DAG definitions |
-d |
--dag-name |
Name of the DAG to convert |
-o |
--output-dir |
Directory to write the Databricks workflow YAML |
-w |
--workflow-name |
Name of the output workflow file |
-c |
--config |
Path to the mapping file (CSV format) |
Example with short form arguments:
dabconvert -i inputs/airflow_dags -d data_processing \
-o resources/jobs -w sample_databricks_etl \
-c inputs/mappings/data_processing/task_list_all.csvThe mapping file should be a CSV with the following columns:
CURRENT_WORKFLOW_NAME,OLD_TASK_NAME,NEW_TASK_NAME,NEW_GIT_PATH,DATABRICKS_WORKFLOW_NAME
data_processing,validate_input_data,data_validation_task,/repos/data_team/etl/validation,sample_databricks_etl
The repository comes with a complete working example that you can use as a reference.
- Input Airflow DAG:
inputs/airflow_dags/data_processing/workflow_setting.py - Task Mapping File:
inputs/mappings/data_processing/task_list_all.csv - Generated JSON:
inputs/dbx_workflows/data_processing_tasks_config.json - Output DABs YAML:
resources/jobs/sample_databricks_etl.yml
validate_input_data = BashOperator(
task_id="validate_input_data",
bash_command='echo "Checking input data..." && sleep 5',
dag=dag,
)
process_data = PythonOperator(
task_id="process_data", python_callable=process_data, provide_context=True, dag=dag
)
export_processed_data = BashOperator(
task_id="export_processed_data",
bash_command='echo "Exporting processed data..." && sleep 5',
dag=dag,
)
generate_report = BashOperator(
task_id="generate_report",
bash_command='echo "Generating report..." && sleep 5',
dag=dag,
)
validate_input_data >> process_data >> [export_processed_data, generate_report]resources:
jobs:
sample_databricks_etl:
name: sample_databricks_etl
tasks:
- task_key: data_validation_task
depends_on: []
notebook_task:
notebook_path: /repos/data_team/etl/validation
source: WORKSPACE
- task_key: data_processing_task
depends_on:
- task_key: data_validation_task
notebook_task:
notebook_path: /repos/data_team/etl/processing
source: WORKSPACE
- task_key: data_export_task
depends_on:
- task_key: data_processing_task
notebook_task:
notebook_path: /repos/data_team/etl/export
source: WORKSPACE
- task_key: report_generation_task
depends_on:
- task_key: data_processing_task
notebook_task:
notebook_path: /repos/data_team/etl/reporting
source: WORKSPACEThe repository includes a sample Jupyter notebook that demonstrates the conversion process step by step. You can use it for interactive development and testing:
cd notebooks
jupyter notebook playbook.ipynbThis notebook shows:
- How to setup the input Airflow DAG and paths
- Specify all the parameters (input dag path, dag name, workflow name, DAB yml output path )
- How to generate Databricks workflow json (Workflow API compatible)
- How to convert it to DAB compatible YAML format
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Commit your changes:
git commit -m 'Add my feature' - Push to the branch:
git push origin feature/my-feature - Submit a pull request