Skip to content

Conversation

@askumar27
Copy link
Contributor

@askumar27 askumar27 commented Dec 9, 2025

feat(ingestion): Add Azure Data Factory connector

📋 Summary

Add a new metadata ingestion connector for Azure Data Factory (ADF) that extracts pipelines, activities, datasets, lineage, and execution history into DataHub.

🎯 Motivation

Azure Data Factory is a widely-used cloud ETL/ELT service for data integration and orchestration. Organizations using ADF need visibility into their data pipelines, lineage, and execution history within DataHub to:

  • Discover and understand data pipelines across their Azure infrastructure
  • Track data lineage from source to destination systems
  • Monitor pipeline execution history and status
  • Enable governance and compliance for data workflows

This connector fills a gap in DataHub's Azure ecosystem coverage, complementing existing Azure connectors (Azure AD, Azure Blob Storage).

🔧 Changes Overview

New Features

  • Azure Data Factory Source Connector - Full metadata extraction from ADF

    • Data Factories → DataHub Containers
    • Pipelines → DataHub DataFlows
    • Activities → DataHub DataJobs (Copy, Data Flow, Lookup, ExecutePipeline, etc.)
    • Datasets → DataHub Datasets (mapped by linked service type)
    • Pipeline Runs → DataHub DataProcessInstances (optional)
  • Table-Level Lineage Extraction

    • Copy Activity input/output dataset mapping
    • Data Flow source/sink extraction
    • Lookup Activity dataset references
    • Platform mapping for 40+ linked service types (Snowflake, SQL, BigQuery, S3, Synapse, etc.)
  • Pipeline-to-Pipeline Lineage

    • ExecutePipeline activities create lineage to child pipeline's first activity
    • Enables tracing orchestration hierarchies across nested pipelines
    • Direction: ExecutePipelineChildFirstActivity
  • Data Flow Script Extraction

    • Stores ADF Data Flow transformation scripts in dataTransformLogic aspect
    • Enables viewing transformation logic in DataHub UI
  • Execution History (Optional)

    • Pipeline run status, duration, and timestamps
    • Trigger information
    • Run parameters
    • Configurable lookback period (1-90 days)
  • Unified Azure Authentication Module

    • Reusable AzureCredentialConfig class for future Azure connectors
    • Supports: Service Principal, Managed Identity, Azure CLI, DefaultAzureCredential
  • Stateful Ingestion

    • Stale entity removal support
    • Tracks previously ingested entities

Dependencies

  • azure-identity>=1.21.0 - Azure authentication
  • azure-mgmt-datafactory>=9.0.0 - ADF Management SDK

🏗️ Architecture/Design Notes

SDK V2 Implementation

This connector uses DataHub SDK V2 (datahub.sdk.Container, datahub.sdk.DataFlow, datahub.sdk.DataJob), following the pattern established by modern connectors. Benefits:

  • Type-safe entity creation
  • Automatic aspect emission (dataPlatformInstance, status, browsePathsV2)
  • Proper browse path hierarchy via parent_container

Entity Hierarchy

Azure Data Factory (Platform)
└── Data Factory (Container)
    └── Pipeline (DataFlow)
        └── Activity (DataJob)
            └── Pipeline Run (DataProcessInstance) [optional]

URN Strategy

Pipeline URNs include factory name for uniqueness across multiple factories:

urn:li:dataFlow:(azure-data-factory,{factory_name}.{pipeline_name},{env})
urn:li:dataJob:(urn:li:dataFlow:(azure-data-factory,{factory_name}.{pipeline_name},{env}),{activity_name})

This ensures uniqueness when multiple factories have identically-named pipelines.

Lineage Types

Lineage Type Source Target Mechanism
Dataset → DataJob Copy Activity inputs DataJob inlets dataJobInputOutput.inputDatasets
DataJob → Dataset Copy Activity outputs DataJob outlets dataJobInputOutput.outputDatasets
DataJob → DataJob ExecutePipeline Child's first activity dataJobInputOutput.inputDatajobs
Data Flow Sources ExecuteDataFlow Source datasets Parsed from Data Flow definition
Data Flow Sinks ExecuteDataFlow Sink datasets Parsed from Data Flow definition

Reusable Azure Auth

The AzureCredentialConfig class in azure/azure_auth.py is designed to be reused by future Azure connectors (e.g., Azure Synapse, Azure Purview).

🧪 Testing

Unit Tests (35 tests)

  • Configuration validation
  • Linked service platform mapping (40+ service types)
  • Activity subtype mapping (20+ activity types)
  • Table/file name extraction logic
  • Run status mapping
  • Resource group extraction

Integration Tests (16 tests, 10 golden files)

Test Scenario Description Golden File
Basic ingestion Factories, pipelines, activities adf_basic_golden.json
Execution history Pipeline runs and status adf_with_runs_golden.json
Platform instance Multi-instance configuration adf_platform_instance_golden.json
Nested pipelines ExecutePipeline with children adf_nested_golden.json
ForEach loops Iterative processing adf_foreach_golden.json
Branching If-Condition/Switch activities adf_branching_golden.json
Data Flows Mapping Data Flow lineage adf_dataflow_golden.json
Multi-source ETL SQL→Blob→Synapse chain adf_multisource_golden.json
Diverse activities 9+ activity types adf_diverse_golden.json
Mixed dependencies Pipeline + Dataset lineage adf_mixed_deps_golden.json

Test Coverage

  • All tests passing: 51/51 (35 unit + 16 integration)
  • Golden files validate complete MCP output
  • Mocked Azure API responses based on real SDK structures

📊 Impact Assessment

Aspect Assessment
Affected Components New connector only - no changes to existing code
Breaking Changes None
Performance Impact N/A (new feature)
Risk Level Low - Isolated new connector with comprehensive tests

🚀 Deployment Notes

Installation

pip install 'acryl-datahub[azure-data-factory]'

Required Configuration

source:
  type: azure-data-factory
  config:
    subscription_id: ${AZURE_SUBSCRIPTION_ID}
    credential:
      authentication_method: service_principal
      client_id: ${AZURE_CLIENT_ID}
      client_secret: ${AZURE_CLIENT_SECRET}
      tenant_id: ${AZURE_TENANT_ID}

Azure Permissions Required

  • Reader role on Data Factory resources (minimum)
  • Data Factory Contributor for execution history

Platform Instance Usage

Use platform_instance when you have multiple ADF deployments that need to be distinguished:

# Scenario: Same pipeline names across dev/prod factories
source:
  type: azure-data-factory
  config:
    subscription_id: ${AZURE_SUBSCRIPTION_ID}
    platform_instance: "production"  # Distinguishes from dev instance

📁 Files Changed

Path Description
setup.py Plugin registration and dependencies
source/azure/azure_auth.py Reusable Azure authentication config
source/azure_data_factory/adf_source.py Main connector implementation
source/azure_data_factory/adf_config.py Configuration classes
source/azure_data_factory/adf_client.py Azure API client wrapper
source/azure_data_factory/adf_models.py Pydantic models for API responses
source/azure_data_factory/adf_report.py Ingestion report metrics
docs/sources/azure_data_factory/ User documentation
tests/unit/azure_data_factory/ Unit tests (35 tests)
tests/integration/azure_data_factory/ Integration tests (16 tests, 10 golden files)

📚 Documentation

✅ Checklist

  • Code follows DataHub SDK V2 patterns
  • Unit tests added (35 tests)
  • Integration tests with golden files (16 tests, 10 golden files)
  • Documentation created
  • Linting passes (ruff, mypy)
  • No breaking changes to existing code
  • Pipeline-to-pipeline lineage implemented
  • Dataset lineage implemented
  • Data Flow script extraction implemented
  • Mixed dependencies test validates both lineage types

🔗 References

…ta ingestion

- Implemented a new connector to extract metadata from Azure Data Factory, including Data Factories, Pipelines, Activities, and Dataset lineage.
- Added support for multiple authentication methods: Service Principal, Managed Identity, Azure CLI, and DefaultAzureCredential.
- Introduced configuration options for filtering factories and pipelines, as well as options for including execution history and lineage extraction.
- Created comprehensive documentation and example recipes for easy setup and usage.
- Added integration and unit tests to ensure functionality and reliability of the connector.
@codecov
Copy link

codecov bot commented Dec 9, 2025

- Added support for Azure Data Factory logos and updated constants for platform identification.
- Implemented pipeline-to-pipeline lineage tracking for ExecutePipeline activities, enabling better visibility of dependencies in the DataHub UI.
- Updated documentation to reflect new features and improved metadata ingestion capabilities.
- Refactored code for better clarity and maintainability, including type definitions for ADF API responses.
- Adjusted test cases to ensure accuracy with the new changes.
@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Dec 9, 2025

✅ Meticulous spotted 0 visual differences across 982 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit c6943ee. This comment will update as new commits are pushed.

@codecov
Copy link

codecov bot commented Dec 9, 2025

Bundle Report

Changes will increase total bundle size by 92.83kB (0.32%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.85MB 92.83kB (0.32%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 92.83kB 19.23MB 0.49%

Files in assets/index-*.js:

  • ./src/app/ingest/source/builder/constants.ts → Total Size: 7.08kB

  • ./src/app/ingestV2/source/builder/constants.ts → Total Size: 6.22kB

  • ./src/images/azuredatafactorylogo.svg → Total Size: 1.8kB

…integration

- Implemented support for mixed pipeline and dataset dependencies in Azure Data Factory, allowing for both pipeline-to-pipeline and dataset lineage tracking.
- Updated documentation to reflect new features and improved clarity on lineage extraction.
- Added integration tests to validate the handling of mixed dependencies, ensuring accurate lineage representation in the DataHub UI.
- Refactored existing tests to accommodate new scenarios and ensure comprehensive coverage of ADF functionalities.
… recipes for Azure Data Factory connector

- Introduced detailed documentation for the Azure Data Factory connector, covering metadata extraction, prerequisites, and configuration options.
- Added example recipes to facilitate quick setup and usage of the connector.
- Documented various authentication methods and their configurations, enhancing user guidance.
- Included information on lineage extraction capabilities and entity mapping for better understanding of the integration.
…n tests

- Replace X | Y union syntax with Optional[X] for Python 3.9 compatibility
- Add isinstance checks before accessing source.report for proper type narrowing
- Add missing type annotation for tmp_path parameter
- Add azure-data-factory to full_test_dev_requirements in setup.py
- Ensures azure.mgmt.datafactory is installed during test runs
- Fixes ModuleNotFoundError in unit/integration tests
Comment on lines +23 to +26
| **Service Principal** | Production environments | `authentication_method: service_principal` |
| **Managed Identity** | Azure-hosted deployments (VMs, AKS, App Service) | `authentication_method: managed_identity` |
| **Azure CLI** | Local development | `authentication_method: cli` (run `az login` first) |
| **DefaultAzureCredential** | Flexible environments | `authentication_method: default` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, only service principal config validation is tested but not the auth itself.
Is there any way we could properly test all authentication mechanisms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - we test configuration validation only (required fields, valid combinations). Testing actual authentication would require:

  1. Live Azure credentials - will be added to connector tests
  2. Mock Azure AD token endpoints (complex) - can give it a shot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is easier to cover in connector-tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the plan :)

MAX_PARAMETER_VALUE_LENGTH = 100 # Truncate long parameter values

# Mapping of ADF linked service types to DataHub platforms
LINKED_SERVICE_PLATFORM_MAP: dict[str, str] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any public doc with all existing platform keys that we can reference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update code comments - added link to metadata-service/configuration/src/main/resources/bootstrap_mcps/data-platforms.yaml which is the canonical source of platform identifiers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant public doc for ADF linked services list 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is one 🤔

Copy link
Contributor

@sgomezvillamor sgomezvillamor Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric and log look really good

As an extra bonus, I just missed:

  • some more fine-grained api call tracking (counters and time)
  • some info/debug logs along the code
  • lineage_edges_extracted could be splited for different types of lineage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Added api_call_counts_by_type: Dict[str, int] for granular tracking
  • Added total_api_response_time_seconds: float for timing
  • Split lineage into dataset_lineage_extracted, pipeline_lineage_extracted, dataflow_lineage_extracted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super. thanks!

what about

some info/debug logs along the code

yesterday I just found a couple of info logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more :) trying to make sure there are no duplicate info from the CLI report

…king

- Updated linked service mappings to consolidate Azure storage types under a single identifier (`abs`).
- Improved configuration options to enable column lineage and execution history extraction by default.
- Enhanced lineage reporting to differentiate between dataset, pipeline, and dataflow lineage types.
- Refactored API call tracking for better granularity and added support for timing metrics.
- Updated documentation to clarify naming rules, uniqueness handling, and case sensitivity in Azure Data Factory.
- Adjusted integration tests to reflect changes in platform mappings and lineage extraction logic.
…neage caching

- Removed default inclusion of datasets, linked services, and triggers from the Azure Data Factory configuration.
- Updated lineage caching logic to rely on a single `include_lineage` option for better clarity and efficiency.
- Adjusted related documentation to reflect the changes in configuration and caching behavior.
…ced metadata tracking

- Added functionality to emit activity runs as DataProcessInstance entities linked to DataJobs, improving the granularity of execution history.
- Introduced a new method `_emit_activity_runs` to handle the extraction and mapping of activity run properties, including status, duration, and error handling.
- Updated integration tests to validate the extraction of activity runs and their properties, ensuring accurate representation in the DataHub UI.
- Enhanced unit tests to cover activity run property extraction and URN mapping, ensuring robustness in handling various scenarios.
…stion process

- Added detailed logging to track the start of ingestion, resource group filtering, lineage resource fetching, and pipeline extraction for better observability.
- Updated execution history logging to clarify the fetching process for factory execution history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants