WFX - Workflow Migration Utility For Databricks

A tool for migrating Airflow DAGs to Databricks Asset Bundles (DABs).

Overview

WFX streamlines the migration process from Airflow to Databricks by automatically converting Apache Airflow DAG configurations into Databricks Asset Bundle YAML files. The tool preserves the DAG's structure, dependencies, and parameters while adapting them to the Databricks workflow format.

Project Structure

wfx/
├── configs/                          # Cluster configuration
│   └── clusters/                     # Databricks cluster specifications
│       ├── dev.json                  # Development cluster config example
│       └── prod.json                 # Production cluster config example
├── inputs/
│   ├── airflow_dags/                 # Source Airflow DAG files
│   │   └── data_processing/          # Example Airflow DAG directory
│   │       └── workflow_setting.py   # Sample Airflow DAG
│   ├── dbx_workflows/                # Workflow API compatible JSON
│   │   └── data_processing_tasks_config.json  # Generated task config json
│   └── mappings/                     # Task mappings from Airflow to Databricks
│       └── data_processing/          # Example DAG mapping directory
│           └── task_list_all.csv     # CSV mapping old Airflow tasks to new Databricks tasks
├── resources/
│   └── jobs/                         # Output DABs YAML files
│       └── sample_databricks_etl.yml # Generated Databricks workflow yml
├── src/
│   └── wfx/
│       ├── constants/                # Global settings
│       │   ├── __init__.py
│       │   └── settings.py           # Path configurations and constants
│       ├── core/                     # Core conversion logic
│       │   ├── __init__.py
│       │   └── converter.py          # DABs converter
│       └── processors/               # Input processors
│           ├── __init__.py
│           ├── airflow.py            # Airflow DAG parser
│           └── gsheet.py             # Mapping file reader
└── notebooks/                        # Jupyter notebooks for interactive use
    └── playbook.ipynb                # Playbook / Sample conversion notebook

Features

Task Dependency Preservation: Maintains the exact dependency structure of Airflow DAGs
Parameter Mapping: Converts Airflow task parameters to Databricks notebook parameters
Flexible Configuration: Supports customization via mapping files (input task name, output task name, filepaths)
Support for Multiple Task Types: Handles various Databricks task types (notebook, Python, SQL, etc.)
CLI Interface: Easy-to-use command line tool + Python playbook (notebook)

Installation

Prerequisites

Python 3.11+
Poetry (for dependency management)

Setup

Clone the repository:

git clone https://github.com/afaqueahmad7117/wfx.git
cd wfx

Install dependencies:
```
poetry install
```

Activate the Poetry environment:

poetry env activate
# After running this command, something like the below will be printed, run it to activate the env
# source /Users/.../<env-name>/bin/activate

Verify installation:
```
dabconvert --help
```

Usage

Command Line

The tool is used as a command-line utility:

dabconvert -i ALL_DAGs_DIR \
           -d DAG_NAME_DIR \
           -o OUTPUT_DIR \
           -w OUTPUT_WORKFLOW_NAME \
           -c CONFIG_PATH

Make sure you've activated the Poetry environment with poetry shell before running the command.

Command Line Arguments

Short Form	Long Form	Description
`-i`	`--input-dir`	Directory containing Airflow DAG definitions
`-d`	`--dag-name`	Name of the DAG to convert
`-o`	`--output-dir`	Directory to write the Databricks workflow YAML
`-w`	`--workflow-name`	Name of the output workflow file
`-c`	`--config`	Path to the mapping file (CSV format)

Example with short form arguments:

dabconvert -i inputs/airflow_dags -d data_processing \
           -o resources/jobs -w sample_databricks_etl \
           -c inputs/mappings/data_processing/task_list_all.csv

Mapping File Format

The mapping file should be a CSV with the following columns:

CURRENT_WORKFLOW_NAME,OLD_TASK_NAME,NEW_TASK_NAME,NEW_GIT_PATH,DATABRICKS_WORKFLOW_NAME
data_processing,validate_input_data,data_validation_task,/repos/data_team/etl/validation,sample_databricks_etl

Example

The repository comes with a complete working example that you can use as a reference.

Sample Files Location

Input Airflow DAG: inputs/airflow_dags/data_processing/workflow_setting.py
Task Mapping File: inputs/mappings/data_processing/task_list_all.csv
Generated JSON: inputs/dbx_workflows/data_processing_tasks_config.json
Output DABs YAML: resources/jobs/sample_databricks_etl.yml

Input Airflow DAG Example

validate_input_data = BashOperator(
    task_id="validate_input_data",
    bash_command='echo "Checking input data..." && sleep 5',
    dag=dag,
)

process_data = PythonOperator(
    task_id="process_data", python_callable=process_data, provide_context=True, dag=dag
)

export_processed_data = BashOperator(
    task_id="export_processed_data",
    bash_command='echo "Exporting processed data..." && sleep 5',
    dag=dag,
)

generate_report = BashOperator(
    task_id="generate_report",
    bash_command='echo "Generating report..." && sleep 5',
    dag=dag,
)

validate_input_data >> process_data >> [export_processed_data, generate_report]

Output Databricks Workflow YAML Example

resources:
  jobs:
    sample_databricks_etl:
      name: sample_databricks_etl
      tasks:
      - task_key: data_validation_task
        depends_on: []
        notebook_task:
          notebook_path: /repos/data_team/etl/validation
          source: WORKSPACE
      - task_key: data_processing_task
        depends_on:
        - task_key: data_validation_task
        notebook_task:
          notebook_path: /repos/data_team/etl/processing
          source: WORKSPACE
      - task_key: data_export_task
        depends_on:
        - task_key: data_processing_task
        notebook_task:
          notebook_path: /repos/data_team/etl/export
          source: WORKSPACE
      - task_key: report_generation_task
        depends_on:
        - task_key: data_processing_task
        notebook_task:
          notebook_path: /repos/data_team/etl/reporting
          source: WORKSPACE

Interactive Development

The repository includes a sample Jupyter notebook that demonstrates the conversion process step by step. You can use it for interactive development and testing:

cd notebooks
jupyter notebook playbook.ipynb

This notebook shows:

How to setup the input Airflow DAG and paths
Specify all the parameters (input dag path, dag name, workflow name, DAB yml output path )
How to generate Databricks workflow json (Workflow API compatible)
How to convert it to DAB compatible YAML format

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Commit your changes: git commit -m 'Add my feature'
Push to the branch: git push origin feature/my-feature
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs/clusters		configs/clusters
docs		docs
inputs		inputs
notebooks		notebooks
resources/jobs		resources/jobs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WFX - Workflow Migration Utility For Databricks

Overview

Project Structure

Features

Installation

Prerequisites

Setup

Usage

Command Line

Command Line Arguments

Mapping File Format

Example

Sample Files Location

Input Airflow DAG Example

Output Databricks Workflow YAML Example

Interactive Development

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WFX - Workflow Migration Utility For Databricks

Overview

Project Structure

Features

Installation

Prerequisites

Setup

Usage

Command Line

Command Line Arguments

Mapping File Format

Example

Sample Files Location

Input Airflow DAG Example

Output Databricks Workflow YAML Example

Interactive Development

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages