feat(azure-data-factory): add Azure Data Factory connector #15499

askumar27 · 2025-12-09T00:05:23Z

feat(ingestion): Add Azure Data Factory connector

📋 Summary

Add a new metadata ingestion connector for Azure Data Factory (ADF) that extracts pipelines, activities, datasets, lineage, and execution history into DataHub.

🎯 Motivation

Azure Data Factory is a widely-used cloud ETL/ELT service for data integration and orchestration. Organizations using ADF need visibility into their data pipelines, lineage, and execution history within DataHub to:

Discover and understand data pipelines across their Azure infrastructure
Track data lineage from source to destination systems
Monitor pipeline execution history and status
Enable governance and compliance for data workflows

This connector fills a gap in DataHub's Azure ecosystem coverage, complementing existing Azure connectors (Azure AD, Azure Blob Storage).

🔧 Changes Overview

New Features

Azure Data Factory Source Connector - Full metadata extraction from ADF
- Data Factories → DataHub Containers
- Pipelines → DataHub DataFlows
- Activities → DataHub DataJobs (Copy, Data Flow, Lookup, ExecutePipeline, etc.)
- Datasets → DataHub Datasets (mapped by linked service type)
- Pipeline Runs → DataHub DataProcessInstances (optional)
Table-Level Lineage Extraction
- Copy Activity input/output dataset mapping
- Data Flow source/sink extraction
- Lookup Activity dataset references
- Platform mapping for 40+ linked service types (Snowflake, SQL, BigQuery, S3, Synapse, etc.)
Pipeline-to-Pipeline Lineage
- ExecutePipeline activities create lineage to child pipeline's first activity
- Enables tracing orchestration hierarchies across nested pipelines
- Direction: ExecutePipeline → ChildFirstActivity
Data Flow Script Extraction
- Stores ADF Data Flow transformation scripts in dataTransformLogic aspect
- Enables viewing transformation logic in DataHub UI
Execution History (Optional)
- Pipeline run status, duration, and timestamps
- Trigger information
- Run parameters
- Configurable lookback period (1-90 days)
Unified Azure Authentication Module
- Reusable AzureCredentialConfig class for future Azure connectors
- Supports: Service Principal, Managed Identity, Azure CLI, DefaultAzureCredential
Stateful Ingestion
- Stale entity removal support
- Tracks previously ingested entities

Dependencies

azure-identity>=1.21.0 - Azure authentication
azure-mgmt-datafactory>=9.0.0 - ADF Management SDK

🏗️ Architecture/Design Notes

SDK V2 Implementation

This connector uses DataHub SDK V2 (datahub.sdk.Container, datahub.sdk.DataFlow, datahub.sdk.DataJob), following the pattern established by modern connectors. Benefits:

Type-safe entity creation
Automatic aspect emission (dataPlatformInstance, status, browsePathsV2)
Proper browse path hierarchy via parent_container

Entity Hierarchy

Azure Data Factory (Platform)
└── Data Factory (Container)
    └── Pipeline (DataFlow)
        └── Activity (DataJob)
            └── Pipeline Run (DataProcessInstance) [optional]

URN Strategy

Pipeline URNs include factory name for uniqueness across multiple factories:

urn:li:dataFlow:(azure-data-factory,{factory_name}.{pipeline_name},{env})
urn:li:dataJob:(urn:li:dataFlow:(azure-data-factory,{factory_name}.{pipeline_name},{env}),{activity_name})

This ensures uniqueness when multiple factories have identically-named pipelines.

Lineage Types

Lineage Type	Source	Target	Mechanism
Dataset → DataJob	Copy Activity inputs	DataJob inlets	`dataJobInputOutput.inputDatasets`
DataJob → Dataset	Copy Activity outputs	DataJob outlets	`dataJobInputOutput.outputDatasets`
DataJob → DataJob	ExecutePipeline	Child's first activity	`dataJobInputOutput.inputDatajobs`
Data Flow Sources	ExecuteDataFlow	Source datasets	Parsed from Data Flow definition
Data Flow Sinks	ExecuteDataFlow	Sink datasets	Parsed from Data Flow definition

Reusable Azure Auth

The AzureCredentialConfig class in azure/azure_auth.py is designed to be reused by future Azure connectors (e.g., Azure Synapse, Azure Purview).

🧪 Testing

Unit Tests (35 tests)

Configuration validation
Linked service platform mapping (40+ service types)
Activity subtype mapping (20+ activity types)
Table/file name extraction logic
Run status mapping
Resource group extraction

Integration Tests (16 tests, 10 golden files)

Test Scenario	Description	Golden File
Basic ingestion	Factories, pipelines, activities	`adf_basic_golden.json`
Execution history	Pipeline runs and status	`adf_with_runs_golden.json`
Platform instance	Multi-instance configuration	`adf_platform_instance_golden.json`
Nested pipelines	ExecutePipeline with children	`adf_nested_golden.json`
ForEach loops	Iterative processing	`adf_foreach_golden.json`
Branching	If-Condition/Switch activities	`adf_branching_golden.json`
Data Flows	Mapping Data Flow lineage	`adf_dataflow_golden.json`
Multi-source ETL	SQL→Blob→Synapse chain	`adf_multisource_golden.json`
Diverse activities	9+ activity types	`adf_diverse_golden.json`
Mixed dependencies	Pipeline + Dataset lineage	`adf_mixed_deps_golden.json`

Test Coverage

All tests passing: 51/51 (35 unit + 16 integration)
Golden files validate complete MCP output
Mocked Azure API responses based on real SDK structures

📊 Impact Assessment

Aspect	Assessment
Affected Components	New connector only - no changes to existing code
Breaking Changes	None
Performance Impact	N/A (new feature)
Risk Level	Low - Isolated new connector with comprehensive tests

🚀 Deployment Notes

Installation

pip install 'acryl-datahub[azure-data-factory]'

Required Configuration

source:
  type: azure-data-factory
  config:
    subscription_id: ${AZURE_SUBSCRIPTION_ID}
    credential:
      authentication_method: service_principal
      client_id: ${AZURE_CLIENT_ID}
      client_secret: ${AZURE_CLIENT_SECRET}
      tenant_id: ${AZURE_TENANT_ID}

Azure Permissions Required

Reader role on Data Factory resources (minimum)
Data Factory Contributor for execution history

Platform Instance Usage

Use platform_instance when you have multiple ADF deployments that need to be distinguished:

# Scenario: Same pipeline names across dev/prod factories
source:
  type: azure-data-factory
  config:
    subscription_id: ${AZURE_SUBSCRIPTION_ID}
    platform_instance: "production"  # Distinguishes from dev instance

📁 Files Changed

Path	Description
`setup.py`	Plugin registration and dependencies
`source/azure/azure_auth.py`	Reusable Azure authentication config
`source/azure_data_factory/adf_source.py`	Main connector implementation
`source/azure_data_factory/adf_config.py`	Configuration classes
`source/azure_data_factory/adf_client.py`	Azure API client wrapper
`source/azure_data_factory/adf_models.py`	Pydantic models for API responses
`source/azure_data_factory/adf_report.py`	Ingestion report metrics
`docs/sources/azure_data_factory/`	User documentation
`tests/unit/azure_data_factory/`	Unit tests (35 tests)
`tests/integration/azure_data_factory/`	Integration tests (16 tests, 10 golden files)

📚 Documentation

User Guide - Prerequisites, capabilities, configuration
Quick Reference - Setup and config table
Example Recipe

✅ Checklist

🔗 References

…ta ingestion - Implemented a new connector to extract metadata from Azure Data Factory, including Data Factories, Pipelines, Activities, and Dataset lineage. - Added support for multiple authentication methods: Service Principal, Managed Identity, Azure CLI, and DefaultAzureCredential. - Introduced configuration options for filtering factories and pipelines, as well as options for including execution history and lineage extraction. - Created comprehensive documentation and example recipes for easy setup and usage. - Added integration and unit tests to ensure functionality and reliability of the connector.

codecov · 2025-12-09T00:08:19Z

Codecov Report

❌ Patch coverage is 85.04028% with 130 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
.../ingestion/source/azure_data_factory/adf_source.py	85.57%	60 Missing ⚠️
.../ingestion/source/azure_data_factory/adf_client.py	54.62%	49 Missing ⚠️
.../ingestion/source/azure_data_factory/adf_report.py	85.91%	10 Missing ⚠️
...n/src/datahub/ingestion/source/azure/azure_auth.py	80.85%	9 Missing ⚠️
.../ingestion/source/azure_data_factory/adf_models.py	99.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Added support for Azure Data Factory logos and updated constants for platform identification. - Implemented pipeline-to-pipeline lineage tracking for ExecutePipeline activities, enabling better visibility of dependencies in the DataHub UI. - Updated documentation to reflect new features and improved metadata ingestion capabilities. - Refactored code for better clarity and maintainability, including type definitions for ADF API responses. - Adjusted test cases to ensure accuracy with the new changes.

alwaysmeticulous · 2025-12-09T01:26:51Z

✅ Meticulous spotted 0 visual differences across 982 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit c6943ee. This comment will update as new commits are pushed.}

codecov · 2025-12-09T01:28:31Z

Bundle Report

Changes will increase total bundle size by 92.83kB (0.32%) ⬆️. This is within the configured threshold ✅

Detailed changes

Bundle name	Size	Change
datahub-react-web-esm	28.85MB	92.83kB (0.32%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name	Size Change	Total Size	Change (%)
`assets/index-*.js`	92.83kB	19.23MB	0.49%

Files in assets/index-*.js:

./src/app/ingest/source/builder/constants.ts → Total Size: 7.08kB
./src/app/ingestV2/source/builder/constants.ts → Total Size: 6.22kB
./src/images/azuredatafactorylogo.svg → Total Size: 1.8kB

…integration - Implemented support for mixed pipeline and dataset dependencies in Azure Data Factory, allowing for both pipeline-to-pipeline and dataset lineage tracking. - Updated documentation to reflect new features and improved clarity on lineage extraction. - Added integration tests to validate the handling of mixed dependencies, ensuring accurate lineage representation in the DataHub UI. - Refactored existing tests to accommodate new scenarios and ensure comprehensive coverage of ADF functionalities.

… recipes for Azure Data Factory connector - Introduced detailed documentation for the Azure Data Factory connector, covering metadata extraction, prerequisites, and configuration options. - Added example recipes to facilitate quick setup and usage of the connector. - Documented various authentication methods and their configurations, enhancing user guidance. - Included information on lineage extraction capabilities and entity mapping for better understanding of the integration.

…ry for docgen

…n tests - Replace X | Y union syntax with Optional[X] for Python 3.9 compatibility - Add isinstance checks before accessing source.report for proper type narrowing - Add missing type annotation for tmp_path parameter

…patibility

- Add azure-data-factory to full_test_dev_requirements in setup.py - Ensures azure.mgmt.datafactory is installed during test runs - Fixes ModuleNotFoundError in unit/integration tests

metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md

sgomezvillamor · 2025-12-11T15:36:06Z

metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md

+| **Service Principal**      | Production environments                          | `authentication_method: service_principal`          |
+| **Managed Identity**       | Azure-hosted deployments (VMs, AKS, App Service) | `authentication_method: managed_identity`           |
+| **Azure CLI**              | Local development                                | `authentication_method: cli` (run `az login` first) |
+| **DefaultAzureCredential** | Flexible environments                            | `authentication_method: default`                    |


From what I see, only service principal config validation is tested but not the auth itself.
Is there any way we could properly test all authentication mechanisms?

Correct - we test configuration validation only (required fields, valid combinations). Testing actual authentication would require:

Live Azure credentials - will be added to connector tests

Mock Azure AD token endpoints (complex) - can give it a shot

maybe this is easier to cover in connector-tests

yes, that's the plan :)

metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md

metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_config.py

metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_models.py

sgomezvillamor · 2025-12-11T16:00:35Z

metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_source.py

+MAX_PARAMETER_VALUE_LENGTH = 100  # Truncate long parameter values
+
+# Mapping of ADF linked service types to DataHub platforms
+LINKED_SERVICE_PLATFORM_MAP: dict[str, str] = {


is there any public doc with all existing platform keys that we can reference here?

Update code comments - added link to metadata-service/configuration/src/main/resources/bootstrap_mcps/data-platforms.yaml which is the canonical source of platform identifiers.

I meant public doc for ADF linked services list 😅

Not sure if there is one 🤔

sgomezvillamor · 2025-12-11T17:08:06Z

metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_report.py

Metric and log look really good

As an extra bonus, I just missed:

some more fine-grained api call tracking (counters and time)

some info/debug logs along the code

lineage_edges_extracted could be splited for different types of lineage

Added api_call_counts_by_type: Dict[str, int] for granular tracking

Added total_api_response_time_seconds: float for timing

Split lineage into dataset_lineage_extracted, pipeline_lineage_extracted, dataflow_lineage_extracted

super. thanks!

what about

some info/debug logs along the code

yesterday I just found a couple of info logs

Added some more :) trying to make sure there are no duplicate info from the CLI report

metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_source.py

…king - Updated linked service mappings to consolidate Azure storage types under a single identifier (`abs`). - Improved configuration options to enable column lineage and execution history extraction by default. - Enhanced lineage reporting to differentiate between dataset, pipeline, and dataflow lineage types. - Refactored API call tracking for better granularity and added support for timing metrics. - Updated documentation to clarify naming rules, uniqueness handling, and case sensitivity in Azure Data Factory. - Adjusted integration tests to reflect changes in platform mappings and lineage extraction logic.

…neage caching - Removed default inclusion of datasets, linked services, and triggers from the Azure Data Factory configuration. - Updated lineage caching logic to rely on a single `include_lineage` option for better clarity and efficiency. - Adjusted related documentation to reflect the changes in configuration and caching behavior.

…ced metadata tracking - Added functionality to emit activity runs as DataProcessInstance entities linked to DataJobs, improving the granularity of execution history. - Introduced a new method `_emit_activity_runs` to handle the extraction and mapping of activity run properties, including status, duration, and error handling. - Updated integration tests to validate the extraction of activity runs and their properties, ensuring accurate representation in the DataHub UI. - Enhanced unit tests to cover activity run property extraction and URN mapping, ensuring robustness in handling various scenarios.

…stion process - Added detailed logging to track the start of ingestion, resource group filtering, lineage resource fetching, and pipeline extraction for better observability. - Updated execution history logging to clarify the fetching process for factory execution history.

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 9, 2025

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 00:07 View deployment

vercel bot had a problem deploying to Preview December 9, 2025 00:07 Failure

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 01:23 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 01:26 View deployment

vercel bot had a problem deploying to Preview December 9, 2025 01:29 Failure

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 01:42 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 01:45 View deployment

vercel bot had a problem deploying to Preview December 9, 2025 01:48 Failure

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 01:55 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 01:58 View deployment

vercel bot had a problem deploying to Preview December 9, 2025 02:01 Failure

chore(azure-data-factory): add azure-data-factory to capability summa…

25a3349

…ry for docgen

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 02:09 View deployment

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 02:12 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 02:16 View deployment

vercel bot deployed to Preview December 9, 2025 02:25 View deployment

fix(azure): use StrEnum instead of str,Enum mixin for Python 3.11 com…

fadb18c

…patibility

github-actions bot deployed to datahub-wheels (Preview) December 9, 2025 03:00 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 03:03 View deployment

vercel bot deployed to Preview December 9, 2025 03:13 View deployment

fix(ingestion): add azure-data-factory to test requirements

c894ab7

- Add azure-data-factory to full_test_dev_requirements in setup.py - Ensures azure.mgmt.datafactory is installed during test runs - Fixes ModuleNotFoundError in unit/integration tests

github-actions bot deployed to datahub-project-web-react (Preview) December 9, 2025 18:39 View deployment

vercel bot deployed to Preview December 9, 2025 18:50 View deployment