-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(azure-data-factory): add Azure Data Factory connector #15499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ta ingestion - Implemented a new connector to extract metadata from Azure Data Factory, including Data Factories, Pipelines, Activities, and Dataset lineage. - Added support for multiple authentication methods: Service Principal, Managed Identity, Azure CLI, and DefaultAzureCredential. - Introduced configuration options for filtering factories and pipelines, as well as options for including execution history and lineage extraction. - Created comprehensive documentation and example recipes for easy setup and usage. - Added integration and unit tests to ensure functionality and reliability of the connector.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
- Added support for Azure Data Factory logos and updated constants for platform identification. - Implemented pipeline-to-pipeline lineage tracking for ExecutePipeline activities, enabling better visibility of dependencies in the DataHub UI. - Updated documentation to reflect new features and improved metadata ingestion capabilities. - Refactored code for better clarity and maintainability, including type definitions for ADF API responses. - Adjusted test cases to ensure accuracy with the new changes.
|
✅ Meticulous spotted 0 visual differences across 982 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit c6943ee. This comment will update as new commits are pushed. |
Bundle ReportChanges will increase total bundle size by 92.83kB (0.32%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
Files in
|
…integration - Implemented support for mixed pipeline and dataset dependencies in Azure Data Factory, allowing for both pipeline-to-pipeline and dataset lineage tracking. - Updated documentation to reflect new features and improved clarity on lineage extraction. - Added integration tests to validate the handling of mixed dependencies, ensuring accurate lineage representation in the DataHub UI. - Refactored existing tests to accommodate new scenarios and ensure comprehensive coverage of ADF functionalities.
… recipes for Azure Data Factory connector - Introduced detailed documentation for the Azure Data Factory connector, covering metadata extraction, prerequisites, and configuration options. - Added example recipes to facilitate quick setup and usage of the connector. - Documented various authentication methods and their configurations, enhancing user guidance. - Included information on lineage extraction capabilities and entity mapping for better understanding of the integration.
…n tests - Replace X | Y union syntax with Optional[X] for Python 3.9 compatibility - Add isinstance checks before accessing source.report for proper type narrowing - Add missing type annotation for tmp_path parameter
- Add azure-data-factory to full_test_dev_requirements in setup.py - Ensures azure.mgmt.datafactory is installed during test runs - Fixes ModuleNotFoundError in unit/integration tests
metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md
Show resolved
Hide resolved
| | **Service Principal** | Production environments | `authentication_method: service_principal` | | ||
| | **Managed Identity** | Azure-hosted deployments (VMs, AKS, App Service) | `authentication_method: managed_identity` | | ||
| | **Azure CLI** | Local development | `authentication_method: cli` (run `az login` first) | | ||
| | **DefaultAzureCredential** | Flexible environments | `authentication_method: default` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see, only service principal config validation is tested but not the auth itself.
Is there any way we could properly test all authentication mechanisms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - we test configuration validation only (required fields, valid combinations). Testing actual authentication would require:
- Live Azure credentials - will be added to connector tests
- Mock Azure AD token endpoints (complex) - can give it a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is easier to cover in connector-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the plan :)
metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md
Show resolved
Hide resolved
metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md
Outdated
Show resolved
Hide resolved
metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md
Outdated
Show resolved
Hide resolved
metadata-ingestion/docs/sources/azure-data-factory/azure-data-factory_pre.md
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_config.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_models.py
Outdated
Show resolved
Hide resolved
| MAX_PARAMETER_VALUE_LENGTH = 100 # Truncate long parameter values | ||
|
|
||
| # Mapping of ADF linked service types to DataHub platforms | ||
| LINKED_SERVICE_PLATFORM_MAP: dict[str, str] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any public doc with all existing platform keys that we can reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update code comments - added link to metadata-service/configuration/src/main/resources/bootstrap_mcps/data-platforms.yaml which is the canonical source of platform identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant public doc for ADF linked services list 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is one 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric and log look really good
As an extra bonus, I just missed:
- some more fine-grained api call tracking (counters and time)
- some info/debug logs along the code
lineage_edges_extractedcould be splited for different types of lineage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added
api_call_counts_by_type: Dict[str, int]for granular tracking - Added
total_api_response_time_seconds: floatfor timing - Split lineage into
dataset_lineage_extracted,pipeline_lineage_extracted,dataflow_lineage_extracted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super. thanks!
what about
some info/debug logs along the code
yesterday I just found a couple of info logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more :) trying to make sure there are no duplicate info from the CLI report
metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_source.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/azure_data_factory/adf_source.py
Show resolved
Hide resolved
…king - Updated linked service mappings to consolidate Azure storage types under a single identifier (`abs`). - Improved configuration options to enable column lineage and execution history extraction by default. - Enhanced lineage reporting to differentiate between dataset, pipeline, and dataflow lineage types. - Refactored API call tracking for better granularity and added support for timing metrics. - Updated documentation to clarify naming rules, uniqueness handling, and case sensitivity in Azure Data Factory. - Adjusted integration tests to reflect changes in platform mappings and lineage extraction logic.
…neage caching - Removed default inclusion of datasets, linked services, and triggers from the Azure Data Factory configuration. - Updated lineage caching logic to rely on a single `include_lineage` option for better clarity and efficiency. - Adjusted related documentation to reflect the changes in configuration and caching behavior.
…ced metadata tracking - Added functionality to emit activity runs as DataProcessInstance entities linked to DataJobs, improving the granularity of execution history. - Introduced a new method `_emit_activity_runs` to handle the extraction and mapping of activity run properties, including status, duration, and error handling. - Updated integration tests to validate the extraction of activity runs and their properties, ensuring accurate representation in the DataHub UI. - Enhanced unit tests to cover activity run property extraction and URN mapping, ensuring robustness in handling various scenarios.
…stion process - Added detailed logging to track the start of ingestion, resource group filtering, lineage resource fetching, and pipeline extraction for better observability. - Updated execution history logging to clarify the fetching process for factory execution history.
feat(ingestion): Add Azure Data Factory connector
📋 Summary
Add a new metadata ingestion connector for Azure Data Factory (ADF) that extracts pipelines, activities, datasets, lineage, and execution history into DataHub.
🎯 Motivation
Azure Data Factory is a widely-used cloud ETL/ELT service for data integration and orchestration. Organizations using ADF need visibility into their data pipelines, lineage, and execution history within DataHub to:
This connector fills a gap in DataHub's Azure ecosystem coverage, complementing existing Azure connectors (Azure AD, Azure Blob Storage).
🔧 Changes Overview
New Features
Azure Data Factory Source Connector - Full metadata extraction from ADF
Table-Level Lineage Extraction
Pipeline-to-Pipeline Lineage
ExecutePipeline→ChildFirstActivityData Flow Script Extraction
dataTransformLogicaspectExecution History (Optional)
Unified Azure Authentication Module
AzureCredentialConfigclass for future Azure connectorsStateful Ingestion
Dependencies
azure-identity>=1.21.0- Azure authenticationazure-mgmt-datafactory>=9.0.0- ADF Management SDK🏗️ Architecture/Design Notes
SDK V2 Implementation
This connector uses DataHub SDK V2 (
datahub.sdk.Container,datahub.sdk.DataFlow,datahub.sdk.DataJob), following the pattern established by modern connectors. Benefits:dataPlatformInstance,status,browsePathsV2)parent_containerEntity Hierarchy
URN Strategy
Pipeline URNs include factory name for uniqueness across multiple factories:
This ensures uniqueness when multiple factories have identically-named pipelines.
Lineage Types
dataJobInputOutput.inputDatasetsdataJobInputOutput.outputDatasetsdataJobInputOutput.inputDatajobsReusable Azure Auth
The
AzureCredentialConfigclass inazure/azure_auth.pyis designed to be reused by future Azure connectors (e.g., Azure Synapse, Azure Purview).🧪 Testing
Unit Tests (35 tests)
Integration Tests (16 tests, 10 golden files)
adf_basic_golden.jsonadf_with_runs_golden.jsonadf_platform_instance_golden.jsonadf_nested_golden.jsonadf_foreach_golden.jsonadf_branching_golden.jsonadf_dataflow_golden.jsonadf_multisource_golden.jsonadf_diverse_golden.jsonadf_mixed_deps_golden.jsonTest Coverage
📊 Impact Assessment
🚀 Deployment Notes
Installation
pip install 'acryl-datahub[azure-data-factory]'Required Configuration
Azure Permissions Required
Platform Instance Usage
Use
platform_instancewhen you have multiple ADF deployments that need to be distinguished:📁 Files Changed
setup.pysource/azure/azure_auth.pysource/azure_data_factory/adf_source.pysource/azure_data_factory/adf_config.pysource/azure_data_factory/adf_client.pysource/azure_data_factory/adf_models.pysource/azure_data_factory/adf_report.pydocs/sources/azure_data_factory/tests/unit/azure_data_factory/tests/integration/azure_data_factory/📚 Documentation
✅ Checklist
🔗 References