MoTrPAC Proteomics Data Analysis Pipeline
This mass-spectrometry based-proteomics data analysis pipeline uses the programming language WDL for describing workflows. The pipeline is run using caper, a wrapper Python package for using the workflow management system Cromwell. It currently supports two different software for the peptide identication and quantification: MSGF+ and MaxQuant.
Two different software/pipelines are currently supported for the peptide identification and quantification:
- MS-GF+ pipeline: it uses MASIC to extract reporter ion peaks from MS2 spectra and create selected ion chromatograms for each MS/MS parent ion, and MS-GF+ for the peptide identification. Details of the pipeline can be found here. All MoTrPAC datasets are analyzed with this pipeline
- MaxQuant: a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. Originally developed for running only on Windows operating system, recent updates allows the execution on Linux platforms. The user must download and accept the terms on their local computers to generate the required configuration file, that will be used to run MaxQuant on the cloud, increasing speed and performance.
The WDL/Cromwell framework is optimized to run pipelines in high-performance computing environments. The MoTrPAC Bioinformatics Center runs pipelines on Google Cloud Platform (GCP). We used a number of fantastic tools developed by our colleagues from the ENCODE project to run pipelines on GCP (and other HPC platforms).
A brief summary of the steps to set-up a VM to run the Motrpac pipelines on GCP (for details, please, check the caper repo):
- Create a GCP account.
- Enable cloud APIs.
- Install the Google Cloud SDK (Software Development Kit) on your local machine.
- Create a service account and download the key file to your local computer (e.g “
service-account-191919.json
”) - Create a bucket for pipeline inputs and outputs (e.g. gs://pipelines/). Note: a GCP bucket is similar to a folder on your computer or a storage unit, but it is stored on Google's servers in the cloud instead of on your local computer.
- Set up a VM on GCP: create a Virtual Machine (VM) instance from where the pipelines will be run. We recommend the script available in the caper repo. For that, clone the repo on your local machine and run the following command:
$ bash create_instance.sh [INSTANCE_NAME] [PROJECT_ID] [GCP_SERVICE_ACCOUNT_KEY_JSON_FILE] [GCP_OUT_DIR]
# Example for the pipeline:
./create_instance.sh pipeline-instance your-gcp-project-name service-account-191919.json gs://pipelines/results/
- Install
gcsfuse
to mount the bucket on the VM. To mount the bucket on the VM run:
gcsfuse --implicit-dirs pipelines pipelines
- Finally, clone this repo
Several software packages are required to run the proteomics pipelines. All of them are pre-installed in docker containers, which are publicly available in the Artifact Registry. To find out more about this containers, check the readme
Each step of the MSGF+ pipeline has a parameter file with multiple options. The default options are recommended, but users can adjust them. The final parameter folder must be copied to the pipeline bucket (gs://pipeline/parameters
)
A configuration file (in JSON format) is required to analyze a particular dataset in the pipeline. This configuration file contains several key-value pairs that specify the inputs and outputs of the workflow, the location of the input files, pipeline paramenters, sequence database, docker containers, the execution environment, and other parameters needed for execution.
The optimal way to generate the configuration files is to run the create_config_[software].py
script. Check this link to find out more.
Connect to the VM and submit a job by running the command:
caper run motrpac-proteomics-pipeline/wdl/proteomics_msgfplus.wdl -i pipeline/config-file.json
and check job status by running:
caper list
A number of scripts are available in this repo providing additional functionality to interact with GCP. Please, check this file to find out more.