The EGA Submissions repository serves as a comprehensive solution for orchestrating submissions to the European Genome-Phenome Archive (EGA). It encompasses three distinct Workflow Description Language (WDL) scripts, each responsible for a crucial step in the submission process:
-
- This script is designed for submitting data files to the EGA. It manages the efficient and secure transfer of genomic data to the archive, ensuring a seamless submission process.
-
Experiment and Run Registration WDL:
- Responsible for registering the experiment and run details associated with the submitted data. This step ensures accurate tracking and metadata association, enhancing the organization of genomic information within the EGA.
-
- This script handles the finalization of datasets, ensuring completeness and integrity. It encompasses any necessary post-submission steps, providing a comprehensive end-to-end solution for EGA submissions.
The workflow execution follows a sequential order:
-
Step 1: File Transfer
- Initiates the secure transfer of genomic data files to the EGA.
-
Step 2: Experiment and Run Registration
- Registers essential metadata related to the experiment and run, facilitating proper organization within the EGA.
-
Step 3: Dataset Finalization
- Completes the submission process by finalizing datasets and performing any post-submission tasks.
This repository equips users with the necessary tools and resources to streamline the submission of genomic data to the EGA, ensuring a well-organized and efficient workflow from file transfer to dataset finalization.
- Workflows: This directory will house all the WDLs responsible for Sample submission. (e.g., EGAFileTransfer.wdl, RegisterEGAExperimentAndRun.wdl, and EGADatasetFinalizeSubmission.wdl).
- Scripts: This directory contains all the Python code responsible for the delivery of samples and registering metadata using the EGA APIs.
Deploying any changes to this repository has two different methods based on where the changes are made.
In order to push changes to the Python code to GCR, you'll need the gcloud
CLI (Command Line Interface) installed. See directions here.
All Python code is contained in a Docker image, which is stored in Google Container Registry. The Terra workflows then pull these Docker images in the WDLs.
With every change to the Python code, you will need to rebuild and push the Docker image. You will need write access to the Container Registry in the sc-ega-submissions
project. Reach out to Sam Bryant if you need these permissions.
- If you're not already logged in via gcloud, you will have to run
gcloud auth login
first and login via your Broad account. See directions here for installing the gcloud CLI. See directions here for authenticating with gcloud. - If you've not already configured Docker to work with the gcloud CLI, run
gcloud auth configure-docker us-east1-docker.pkg.dev
to ensure authentication. See here for more information.
To rebuild and push the Docker images, run the docker_build.sh script from the root of the repository:
./docker_build.sh
Sometimes, it is helpful to view the contents of the Docker image. To do this, we can simply SSH into the image:
docker pull docker pull us-east1-docker.pkg.dev/sc-ega-submissions/ega-submission-scripts/python-scripts:0.0.1-<tag>
docker run -it us-east1-docker.pkg.dev/sc-ega-submissions/ega-submission-scripts/python-scripts:0.0.1-<tag>
Once you've built and pushed your Docker image, you'll have to find the Docker tag and update all the WDL workflows to use the new tag. You'll see the new Docker tag in the output of the console when you build the image.
Alternatively, you can navigate to GCR via the Google console here. Select the image name (in this case ega-submission-scripts
), select python-scripts
, and copy the latest tag. See screenshot below. In this example, the tag you'll want to copy is 0.0.1-1709154068
.
Once you have the new tag, update the tag in ALL the WDL files where there is a docker image in the runtime attributes. The part you'll want to replace is everything after the :
.
An example of a runtime attribute that would need to be changed is here.
So for example, this old tag would get changed in the following manner to the new tag:
OLD: us-east1-docker.pkg.dev/sc-ega-submissions/ega-submission-scripts/python-scripts:0.0.1-1708546220
NEW: us-east1-docker.pkg.dev/sc-ega-submissions/ega-submission-scripts/python-scripts:0.0.1-1709154068
Once you've made this change, push the changes to the GitHub repository. See Workflow Updates for how to implement changes in WDL files.
Making any changes inside the workflows directory (i.e. to any WDL files) is straightforward. Simply push your changes to the GitHub repository. Since we are using GitHub Apps, these changes will be automatically loaded into Terra. If you'd like to test a feature branch, select your feature branch when running the workflow in Terra (see screenshot below). Alternatively, if you've merged your changes into main
, ensure that main
is selected as the branch (this should be the default branch).
- Register your EGA account and receive credentials (i.e. a username/inbox and password combination)
- Submit a request to receive a "Submitter" role status
- Ensure your Broad email is added as a member to the
SC-EGA-SUBMISSIONS
Terra group (the group email is[email protected]
). Reach out to Sam Bryant to be added to this group.
- Create an initial "parent" Submission (once your request for the "Submitter" role is approved, you'll see an option in the Submitter portal to "Create a Submission")
- Submit at least one Study (see detailed instructions here)
- Upload a cohort of Samples
- Register your DAC/Policy
- Fill out this form, including the same samples from your Study in the tsv which you upload to the form. Filling out this form will create a new Terra workspace, import all the necessary metadata into the data tables, and import all necessary workflows. All workflows will be pre-configured with inputs, there will be little that needs to changed manually.
- Once the ticket generated by the form is auto-resolved, this will indicate that your workspace is ready to go.
- Open your Terra workspace and navigate to the "Workflows" tab.
- The first workflow to be run is the
egaFileTransfer
workflow, which will encrypt and upload your bam/cram files to the EGA.- All workflow inputs and outputs are pre-configured here, and nothing has to be changed.
- Ensure that
Run workflow(s) with inputs defined by data table
is selected - Ensure that the root entity type selected is
sample
- Ensure that
Use call caching
is UNSELECTED - Ensure that
Delete intermediate outputs
is selected - Now click
SELECT DATA
and from the little arrow at the top left box, selectAll
and clickOK
thenRUN ANALYSIS
to submit your run(s). - This workflow runs at the sample-level (i.e. if you have 100 samples in your workspace, 100 analysis will be submitted).
- Though encrypting and uploading the files does not take very long, the EGA validation of files may take a while. It's best to submit this workflow, ensure it finishes and wait at least a day before moving on to the next workflow.
- The second workflow to be run is the
registerEGAExperimentAndRuns
workflow, which will register an experiment and run for each sample in the EGA submission. Only run this workflow once all samples have run to successful completion in the previous step. Note that this workflow will not complete unless the file has been validated on EGA's end. If the file has been validated, the metadata in the workspace will be updated to indicate its valid status and the workflow will continue (you'll seevalidated
show up under thefile_validation_status
column in the workspace metadata). If not, the workflow will exit. If a file has not been validated, you will have to wait and resubmit at a later date since we have no control over how long it takes for EGA to validate a file.- All workflow inputs and outputs are pre-configured here. The only parameter that you might want to change is the
delete_files
parameter. This is by default set to false, but setting it to true will delete the bam/cram after the file has been validated and the experiment/run have been registered in the EGA. - Ensure that
Run workflow(s) with inputs defined by data table
is selected - Ensure that the root entity type selected is
sample
- Ensure that
Use call caching
is UNSELECTED - Ensure that
Delete intermediate outputs
is selected - Now click
SELECT DATA
and from the little arrow at the top left box, selectAll
and clickOK
thenRUN ANALYSIS
to submit your run(s). - This workflow runs at the sample-level (i.e. if you have 100 samples in your workspace, 100 analysis will be submitted).
- All workflow inputs and outputs are pre-configured here. The only parameter that you might want to change is the
- The third and last workflow to be run is the
registerEGADataset
. Only run this workflow once all samples have run to successful completion in the previous step.- All workflow inputs and outputs are pre-configured here, and nothing has to be changed.
- Ensure that
Run workflow(s) with inputs defined by data table
is selected - Ensure that the root entity type selected is
sample_set
- Ensure that
Use call caching
is UNSELECTED - Ensure that
Delete intermediate outputs
is selected - Now click
SELECT DATA
and chose your sample set that includes ALL of your samples (i.e. if you have 100 samples, select the sample set that indicates that it includes 100 entities). SelectOK
, thenRUN ANALYSIS
to submit your run (should only be 1 in this case!). - This workflow runs at the sample set-level (i.e. even if you have 100 samples in your workspace, only 1 analysis should be submitted).
- Once all 3 workflows have run to completion, your submission is finalized! You'll be able to view it in the Submitter Portal in the EGA. All metadata can be changed via the UI if desired.
All EGA passwords are now stored in Google's Secret Manager in the sc-ega-submissions
Google project. By default, our Python code looks in Secret Manager for a secret that's stored using a specific naming convention. The naming convention we follow for the password is {ega_inbox}_password
. So for example, if your EGA inbox (or "username") is ega-box-123
, our Python code looks for a secret stored with the name ega-box-123_password
.
In order for our scripts to work, the corresponding password for the ega-box-123
account should be stored in Secret Manger under the name ega-box-123_password
.
If you have a new account that needs a new password stored, you'll first need permissions to add new Google secrets. Reach out to Sam Bryant to obtain these permissions.
Once you have correct permissions, you can navigate to the sc-ega-submission
Secret Manager console page, click "Create Secret" at the top of the page, and follow the prompts for storing a new password. Once the new password is saved, you can then follow the directions outlined above for submitting your Terra workflows.
- aggregation_path: Pulled from on-prem servers for bams, and Clio for crams.
- analysis_type: Uses the field
analysis_type
from thepicard.read_group_index
andpicard.rapid_qc_index
tables. - avg_mean_insert_size: Takes the average of
MEAN_INSERT_SIZE
field from themetrics.aggregation_insert_size
table. - avg_standard_deviation: Takes the average of
STANDARD_DEVIATION
field from themetrics.aggregation_insert_size
table. - group_library_name: Takes each distinct
library_name
from the query and concatenates each with "_". - illumina_instrument:
illumina_instrument
field from theslxasync.runs
table. - library_layout: Uses the field
PAIRED_RUN
from thepicard.read_group_index
andpicard.rapid_qc_index
tables. Returns 'single' when == 0 else returns 'paired'. - library_selection:
- Case 1: if
index.analysis_type
= 'Resequencing' ANDindex.library_type
= 'WholeGenomeShotgun' THEN 'RANDOM'. - Case 2: 'cDNA' if
index.analysis_type
= 'cDNA'. - Case 3: 'Hybrid Selection' if
index.library_type
= 'HybridSelection'. - Case 4: 'Random' if
index.library_type
= 'WholeGenomeShotgun'. - Case 5: 'OTHER' if the above conditions are not met.
- Case 1: if
- library_source:
- Case 1: 'GENOMIC' if
sample_material_type
equals any of these values ('Genomic DNA', 'DNA:DNA Genomic', 'DNA:DNA Somatic', 'DNA:DNA WGA Qiagen'). - Case 2: 'TRANSCRIPTOMIC' if
sample_material_type
equals any of these values ('RNA:Total RNA', 'RNA'). - Case 3: 'VIRAL RNA' if
sample_material_type
equals 'RNA:ss Viral RNA'. - Case 4: 'OTHER' if the above conditions are not met.
- Case 1: 'GENOMIC' if
- library_strategy:
- Case 1: 'WGS' if
library_type
= 'WholeGenomeShotgun'. - Case 2: 'RNA-Seq' if
library_type
equals any of these values ('cDNAShotgunStrandAgnostic', 'cDNAShotgunReadTwoSense'). - Case 3: 'WXS' if
library_type
= 'HybridSelection'. - Case 4: 'OTHER' if the above conditions are not met.
- Case 1: 'WGS' if
- library_type: Pulled from the
library_type
field from the index tables. - md5: Pulled from Clio for crams and on-prem for bams since files live next to each other.
- paired_run: Pulled from the
paired_run
field from the index tables. - processing_location: Defined by user via the portal form.
- run_file_type: 'bam' if
processing_location
= 'OnPrem' else returns 'cram'. - sample_alias: Passed into the TSV in the portal.
- sample_material_type: Uses the
sample_material_type
field from thepicard.read_group_index
andpicard.rapid_qc_index
tables.
select a.sample, a.processing_location, i.library_type, i.sample_material_type,
i.analysis_type, i.paired_run, runs.illumina_instrument,
AVG(ais.MEAN_INSERT_SIZE) AS avg_mean_insert_size,
AVG(ais.STANDARD_DEVIATION) AS avg_standard_deviation,
LISTAGG(DISTINCT i.LIBRARY_NAME, '_') AS group_library_name,
CASE
WHEN i.library_type = 'WholeGenomeShotgun' THEN 'WGS'
WHEN i.library_type IN ('cDNAShotgunStrandAgnostic', 'cDNAShotgunReadTwoSense') THEN 'RNA-Seq'
WHEN i.library_type = 'HybridSelection' THEN 'WXS'
ELSE 'OTHER'
END AS library_strategy,
CASE
WHEN i.sample_material_type IN ('Genomic DNA', 'DNA:DNA Genomic', 'DNA:DNA Somatic', 'DNA:DNA WGA Qiagen') THEN 'GENOMIC'
WHEN i.sample_material_type IN ('RNA:Total RNA', 'RNA') THEN 'TRANSCRIPTOMIC'
WHEN i.sample_material_type = 'RNA:ss Viral RNA' THEN 'VIRAL RNA'
ELSE 'OTHER'
END AS library_source,
CASE
WHEN i.analysis_type = 'Resequencing' AND i.library_type = 'WholeGenomeShotgun' THEN 'RANDOM'
WHEN i.analysis_type = 'cDNA' THEN 'cDNA'
WHEN i.library_type = 'HybridSelection' THEN 'Hybrid Selection'
WHEN i.library_type = 'WholeGenomeShotgun' THEN 'RANDOM'
ELSE 'OTHER'
END AS library_selection,
CASE
WHEN i.PAIRED_RUN = 0 THEN 'SINGLE' ELSE 'PAIRED'
END AS library_layout,
CASE
WHEN a.processing_location = 'OnPrem' THEN 'bam' ELSE 'cram'
END AS run_file_type
from metrics.aggregation a
join metrics.aggregation_read_group r on r.aggregation_id = a.id
join picard.{index_table} i on i.flowcell_barcode = r.flowcell_barcode
AND i.lane = r.lane
AND i.library_name = r.library_name
join metrics.aggregation_insert_size ais on ais.aggregation_id = a.id
join slxasync.runs runs ON i.RUN_NAME = runs.RUN_NAME
where a.PROJECT = 'value from the portal'
and a.SAMPLE = 'value from the portal'
and a.VERSION = 'pulled from index table'
and a.DATA_TYPE = 'value from the portal'
and a.library is null
and a.is_latest=1
GROUP BY
a.sample,
a.processing_location,
i.library_type,
i.analysis_type,
i.sample_material_type,
i.paired_run,
runs.illumina_instrument;