This repository contains the code and CloudFormation template to deploy the solution described in the blog post. The post walks you through the technical implementation details for building such federated data governance and analysis platforms using AWS. Whether you are a healthcare organization, technology provider, or systems integrator, this post aims to equip you with the technical knowledge to build a secure data platform for secondary usage of health data in alignment with the EHDS Regulation. At the core of this implementation is Amazon DataZone, an out-of-the-box data management service that offers a broad set of fine-grained access controls and data governance configurations. As illustrated in Figure 1, this enables secure data sharing and collaboration across multiple AWS accounts, teams, and departments.
The EHDS Regulation aims to provide a technology-agnostic framework to enable data sharing across systems. Although connections to external and third-party sources are supported through AWS Glue, the integration across different technological stacks requires more customized approaches and is out of the scope of this post. Hence, this post outlines how to build a secure data platform for secondary usage of health data across AWS environments.
The EHDS envisions a structured approach to health data management across EU Member States, involving various stakeholders such as ministries of health, regional healthcare agencies, research institutions, healthcare providers, and pharmaceutical companies. The implementation specifics may vary by Member State.
AWS services can support each data journey stage for implementing the EHDS. Initial data handling (extraction, collection, streaming) uses Amazon S3, AWS Glue, and Amazon EventBridge for GDPR-compliant storage and event-driven processing. Data aggregation and transformation rely on AWS Glue and Amazon SageMaker for standardization and pseudonymization, while Amazon DataZone and AWS Identity and Access Management (IAM) enable secure cross-environment trust association and access management.
For secondary use, Amazon Athena and Amazon DataZone facilitate efficient data discovery and querying. DataZone features support provider interactions, governance, and metadata management, while AWS CloudFormation ensures scalable and consistent GDPR-compliant environments through infrastructure as code (IaC). AWS CloudTrail provides comprehensive access monitoring and audit capabilities, with DataZone maintaining data quality tracking and lineage.
In this blog post, we provide two scenarios of how to implement this architecture:
NOTE: The architecture and technical implementation serve for demonstration purposes.
This represents the primary governance account setup where an Amazon DataZone domain serves as the foundation for centralized data management. This scenario is better for development and testing of the Amazon DataZone capabilities. It creates a comprehensive data management infrastructure.
This represents an extension of the main setup into a secondary AWS account. It maintains similar infrastructure components but operates in a separate account. This enables cross-account data sharing and allows for distributed data management while maintaining centralized discovery and access control.
We will refer to the primary account configured in the single account deployment option as the governance account and the second account configured in the multi-account deployment as the producer account.
The solution enhances the Amazon DataZone domain's centralized data governance with automated workflows for data transformation and publishing. When data is uploaded to an S3 bucket, it triggers an automation flow that crawls the data with an AWS Glue Crawler, adds it to the AWS Glue Data Catalog, and automatically triggers a Data Source run. This allows for the data to be made available for publishing as soon as it is added to an S3 bucket.
We have prepared two CloudFormation templates that provision and configure the necessary AWS services. These templates provide two deployment options, one for each scenario. The instructions are separated in two implementations, depending on your preference and availability of deployment to multiple accounts.
The following prerequisites are necessary to deploy this solution:
- Access to at least one AWS account
- Administrative IAM permissions for Amazon IAM, Amazon DataZone, Amazon LakeFormation, Amazon S3, AWS Lambda, AWS Glue.
Note: While the second deployment option uses two AWS accounts, neither deployment option requires an AWS Organizations setup. By default, all data at rest is encrypted using AWS-managed keys through AWS Key Management Service (KMS). You have the flexibility to use customer-managed keys if you prefer more control over your encryption settings.
This section will walk you through the steps to configure the governance account.
- Clone the repository on GitHub using git. You can also manually download the CloudFormation templates from GitHub.
git clone https://github.com/aws-samples/a-technical-guide-to-building-a-data-platform-for-secure-health-data-use.git
- In the governance account, go to the CloudFormation console within your chosen Region. Under the Stacks section, choose the Create Stack dropdown and choose With new resources (standard).
- Leave the prerequisites section with the default settings. Under the Specify template section, choose Upload a template file. Upload the mainaccount.yml file.
- Give a Stack name of your preference.
- Choose Next. Leave everything default and choose the I acknowledge that AWS CloudFormation might create IAM resources checkbox.
- Choose Next. Review the settings and choose Submit.
After a few minutes, the deployment is complete and you should see an Amazon DataZone domain created. You can also get the data portal’s URL in the outputs of the CloudFormation template.
The single account deployment is now complete. You can test the solution by moving to the section: Upload and publish data.
If you would like to deploy a multi-account environment, follow this next section for setting up and configuring the multi-account deployment.
This section will walk you through the steps to configure the producer account. You need to have completed the single account deployment steps to follow these instructions.
- In the governance account, navigate to the Amazon SageMaker platform. On the dashboard, choose View existing domains for Amazon DataZone and choose the domain with the prefix: DataZoneDomain-
- On the domain page, scroll down and on the first tab named “Account associations” choose Request association, as shown in the following figure.
- A new page named Associate account opens. Enter the AWS Account ID of the data producer account that you want to associate.
- Under RAM Policy, make sure to choose AWSRAMPermissionDataZonePortalReadWrite. Choose Request Association, as shown in the following figure.
You have now requested to associate the producer account to the domain. This allows members of the associated account to access the domain’s data portal and start publishing data.
- Still in the governance account within the Amazon DataZone domain page, go to the tab User management, as shown in the following figure.
- Choose IAM Users from the dropdown. Copy the ARN of the role that contains DataZoneDomainExecutionRole. You need it for a further step, so paste it in a text editor for now.
Next, you create necessary resources and permissions to run the CloudFormation stack in the producer account.
- Log in to your producer account, ensuring you are in the AWS Region where you deployed the Amazon DataZone domain in the governance account.
- On the Amazon SageMaker Platform console, choose View requests. You will see the association request from the governance account. Choose the request and select Review Request.
- A new page opens named Accept & configure AWS association. Choose Accept new permissions.
- Once the association has been created, click the associated domain. Copy the IAM role under the Data portal URL in your text editor, as showed here.
- Scroll down to the Default Blueprints section. Select the Default Data Lake option and click “Enable”. On the next page enable the option for hybrid mode in the Data location registration section. Leave everything as default and click “Enable Blueprint”.
Next you create an IAM role for a Lambda function allowing it to perform actions on the Amazon DataZone domain.
- Still in the producer account, visit the IAM console. Choose Role then Create role.
- Make sure AWS service is chosen. In the Service or use case dropdown, choose Lambda.
- Choose Next. Don’t add any permissions, and choose Next again.
- Give a name to this role, such as Lambda-role. Choose Create role.
- After creation, search for your role and select it. It should not have any permissions for now. Choose Add permissions and Create inline policy.
- Choose the JSON editor view and paste in the following policy. Make sure to replace the ARN with the Amazon DataZone domain execution role that you copied in Step 6.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "<DataZone-domain-execution-role-arn>"
}
]
}
- Give the Policy a name and choose Create policy. Lastly, copy the ARN of this newly created IAM role for your Lambda function and paste it into your text editor.
You must make sure the producer account can access the domain, as the portal’s URL is still greyed out.
- Log back in to your Governance account and visit the Amazon DataZone console. Go to the Amazon DataZone domain under the User management tab, choose Add and Add IAM Users, as shown in the following figure.
- Next, choose the IAM Account option, Associated account, and paste the IAM Role ARN copied from the Amazon DataZone Data portal dashboard in Step 10. Choose Add then Add user(s).
- Visit the IAM console, choose Roles, and search for the prefix: -DataZoneDomainExecutionRole that you copied in Step 6.
- Choose the tab: Trust relationships and Edit trust policy.
- Under Principal, add the following: "AWS": "". This is the IAM Role ARN you copied in Step 17. Your trust policy should now look like the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"datazone.amazonaws.com",
"lakeformation.amazonaws.com",
"lambda.amazonaws.com"
],
"AWS": "<Lambda-role-arn>"
},
"Action": [
"sts:AssumeRole",
"sts:TagSession"
]
}
]
}
- Choose Update policy
You are now ready to deploy the stack in the producer account! In the producer account, go to CloudFormation and follow the same steps as for the main account deployment. Make sure to deploy in the same Region as the domain. This time, upload the secondaccount.yml template. This template needs the following parameters:
- AssumeRoleArn: The ARN of the DataZoneDomainExecution role copied in Step 6.
- DataZoneDomainId: The ID of the Amazon DataZone domain. You can find it in the Amazon DataZone console. The ARN of the domain is structured as follows: arn:aws:datazone:REGION:ACCOUNT_ID:domain/DOMAIN_ID
- HomeRegion: The Region of the Amazon DataZone domain.
- LambdaRoleArn: The ARN of the Lambda role copied in Step 17.
Wait for the stack to deploy and complete. The stack deployed the previously described automation components. You can now start testing the solution!
If you are following the single account setup, then complete the following steps within your governance account. For the multi-account setup, complete the following steps in the producer (second) account:
- In the Console search bar, enter "S3" to navigate to the Amazon S3 section of the Console.
- Choose the bucket with the suffix ehds-assets
- Choose Upload and choose Add folder. Choose the data inside the medical_assets folder. Choose Upload.
After uploading the data, navigate back to the Amazon SageMaker platform and to the DataZone associated domains dashboard.
- In the Amazon DataZone domain, choose the link Data portal URL. This opens the Amazon DataZone domain’s Data portal.
- A project has been pre-created for you. Choose the dropdown next to Select Project and choose the Project with the prefix EHDS-.
- In the Project’s dashboard page, go to Data Sources. Choose the DataSource named ehds-assets.
- It has successfully queried the AWS Glue table that you uploaded to the ehds-assets S3 bucket. Choose the asset link listed Asset Name, as shown in the following figure.
- On the asset’s summary page, you see details such as metadata, lineage, and data quality. It was pre-configured to auto-publish the data asset.
In this section we view the catalog capability of Amazon DataZone that allows search to publish data assets and subscribe to them.
- On the top bar of the Amazon DataZone Platform, choose the Catalogue icon, as shown in the following figure.
- Within the Catalog view, choose the Amazon DataZone Domain from the left menu bar and view the data asset that has been published to the Domain’s meta data catalog.
You can further filter by attributes such as data and asset type, owning project, source Region and account, and domain unit.
If you have followed the multi-account deployment, then you can filter by source account. Choose the account ID of the account from which you have uploaded and published the data assets. Then, you can query the data with Amazon Athena.
- Subscribe to the data asset by choosing the Subscribe button.
- On the subscription page, enter a reason for subscribing, leave everything as default, and choose Request. The approval process was automated. Therefore, your subscription request is auto-approved. The default behavior needs the owner of the data asset to manually approve each request.
- Move back to the project overview page by choosing the project name on the top menu bar
- Within the project you have analytical tools listed on the right side of the overview page. Choose the option Query data (Amazon Athena), as shown in the following figure.
- Choose Open Amazon Athena in the pop-up message.
Within the Athena Console, follow these steps to query the data to which you have subscribed:
- From the left option bar, choose the Database with the suffix _sub_db. This lists the data assets to which you have subscribed.
- Run SQL queries to view the data to which you have subscribed.
By following these steps you have successfully uploaded, published, and subscribed to data assets, and queried the subscribed data using Athena.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.