GSEA Pipeline

This pipeline provides comprehensive tools for Gene Set Enrichment Analysis (GSEA), including preprocessing, running analyses, and post-processing workflows. It integrates multiple pathway enrichment methods for robust gene set analysis.

Overview

The GSEA Pipeline consists of three layers:

Data Retrieval & Preparation: Download gene sets and prepare input data
Enrichment Analysis: Run GSEA using multiple algorithms (GSEA CLI, PADOG, CAMERA)
Results Processing: Merge, validate, and filter enrichment results

Scripts by Category

1. Data Preparation & Retrieval

`retrieveMSigDB.py`

Retrieves gene sets from the MSigDB (Molecular Signatures Database) for enrichment analysis.

Usage:

python retrieveMSigDB.py -o output_directory

Requirements:

Python 3.x
Internet connection for database access

`final_genes.sh`

Extracts gene symbols from .grp files and appends them to a final output file. Useful for tracking genes included in the analysis.

Usage:

./final_genes.sh

Requirements:

Bash shell
.grp files from GSEA output

2. GSEA Execution (Deprecated but maintained for reference)

These scripts execute GSEA analysis. Note: Newer workflows (PADOG, CAMERA) are now preferred for most analyses, but these remain available for specific use cases.

GSEA Installation & Configuration: Before running any GSEA scripts, ensure GSEA is properly installed. Refer to: GSEA User Guide - Running from Command Line

Key requirements:

Java installed and accessible
gsea-cli.sh script in your PATH or explicitly referenced
Gene annotation files (.chip) matching your organism
Gene set matrix files (.gmt) from MSigDB

`GSEA_script.sh`

Test version of GSEA execution with multiple comparisons built-in across multi collections. Useful for validating your GSEA setup before running large analyses.

Configuration: Edit the script to set:

working_dir: Base directory for all GSEA data
expression_data: Path to expression data file (.txt)
phenotype_info: Path to phenotype/class file (.cls)
matrix_dir: Directory containing gene set files (.gmt)
chip_annotations: Directory containing annotation files (.chip)
output_dir: Where results will be saved

Parameters:

nperm: Number of permutations (default: 10,000)
permute: Permutation type (gene_set or phenotype)
metric: Gene ranking metric (Signal2Noise or log2_Ratio_of_Classes)

Key features:

5 example comparisons (Ap_versus_An, Bp_versus_An, etc.)
C3 TF/Legacy gene sets (MSigDB v2023.2)
Proper handling of small sample sizes (permute: gene_set)
Parallel execution with background jobs
Timing information for monitoring

Usage:

./GSEA_script.sh

Customization: Edit the script to modify:

Gene set collections (change c3.tft.tft_legacy.v2023.2.Hs.symbols.gmt)
Comparisons array
Output labels
Analysis parameters

3. Results Processing

`GSEA_merge.sh`

Merges and renames GSEA result files from multiple collections and comparisons into consolidated TSV files. Handles proper header inclusion to avoid duplication.

Structure:

Input:
  main_dir/
  ├── Collection1/
  │   ├── Comparison1_[timestamp]/
  │   │   ├── gsea_report_for_POSITIVE.tsv
  │   │   └── gsea_report_for_NEGATIVE.tsv
  │   └── Comparison2_[timestamp]/
  └── Collection2/
      └── ...

Output:
  destination_dir/
  ├── Collection1_Comparison1_merged.tsv
  ├── Collection1_Comparison2_merged.tsv
  ├── Collection2_Comparison1_merged.tsv
  └── ...

Configuration: Edit the script to set:

main_dir: Base directory containing GSEA results by collection
destination_dir: Where merged files will be saved

Usage:

./GSEA_merge.sh

Features:

Automatic collection and comparison name extraction
Removes timestamps and GSEA-specific formatting
Preserves header row only once in merged files
Handles missing files gracefully

Output: A consolidated file suitable for downstream analysis (filtering, visualization, etc.)

4. Alternative Pathway Enrichment Methods (R Workflows)

For more flexible and statistically robust pathway analysis, we provide R workflows based on established bioinformatics packages.

`padog_analysis.Rmd`

Pathway analysis workflow using PADOG (Pathway Analysis by Differential Gene Expression Ranking).

Method: PADOG ranks genes by differential expression and tests whether pathway genes are significantly ranked higher than expected by chance.

Features:

Supports multiple gene set collections (GMT files)
Multiple comparisons in a single run
Parallel processing for speed
FDR adjustment for multiple testing
Significance filtering and output organization by collection

Input requirements:

Gene expression count matrix (TSV file with gene symbols as rownames)
Metadata file (Excel) with sample annotations and group assignments
Gene set files (GMT format)

Output:

CSV and RDS files per collection
Total consolidated results across all collections
Filtered significant gene sets (FDR < 0.25)

Dependencies:

library(PADOG)
library(readxl)
library(dplyr)
library(stringr)
library(parallel)
library(doParallel)
library(limma)  # for voom

Configuration: Edit the "Define paths" section to set:

main_dir: Base directory for all data
genesets_dir: Directory containing .gmt files
output_dir: Where results will be saved
expression_file: Path to count matrix
metadata_file: Path to sample metadata

Customization:

comparisons: Vector of comparison identifiers (e.g., c("Bp_Ap", "Bp_An", ...))
Nmin: Minimum number of genes in a pathway (default: 3)
NI: Number of permutations (default: 500)
ncr: Number of cores for parallel processing

Usage: Open the Rmd file in RStudio and run chunks sequentially, or knit to HTML:

rmarkdown::render("padog_analysis.Rmd", output_format = "html_document")

`camera_analysis.Rmd`

Gene set enrichment workflow using CAMERA (Correlation-adjusted MEan RAnk gene set test) from the limma package.

Method: CAMERA tests whether genes in a pathway have concordant direction of change using mean rank statistics adjusted for inter-gene correlation.

Features:

Efficient rank-based testing
Accounts for inter-gene correlation
Supports both up and down regulation detection
Multiple collections and comparisons
Trend-based variance modeling

Input requirements:

Gene expression count matrix (TSV file with gene symbols as rownames)
Metadata file (Excel) with sample annotations
Gene set files (GMT format)

Output:

CSV and RDS files per collection
Total consolidated results
Significance-filtered results (FDR < 0.25)

Dependencies:

library(limma)    # for camera() and voom()
library(readxl)
library(dplyr)
library(stringr)

Configuration: Edit the "Define paths" section to set:

main_dir: Base directory for all data
genesets_dir: Directory containing .gmt files
output_dir: Where results will be saved
expression_file: Path to count matrix
metadata_file: Path to sample metadata

Customization:

comparisons: Vector of comparison identifiers
inter.gene.cor: Inter-gene correlation estimate (default: 0.01)
use.ranks: Whether to use gene ranks (default: FALSE)
trend.var: Whether to allow trend in variances (default: TRUE)

Usage: Open the Rmd file in RStudio and run chunks sequentially, or knit:

rmarkdown::render("camera_analysis.Rmd", output_format = "html_document")

`md5compare.py`

Compares MD5 checksums between files to ensure data integrity during transfer or processing.

Usage:

python md5compare.py file1 file2

Output: Reports whether files are identical (matching MD5 checksums) or different.

Workflow Examples

Complete GSEA Analysis (CLI-based)

# 1. Retrieve gene sets
python retrieveMSigDB.py -o ./gene_sets

# 2. Prepare expression data and phenotype file

# 3. Run GSEA 
./GSEA_script.sh

# 4. Merge results
./GSEA_merge.sh

# 5. Extract final genes (optional)
./final_genes.sh

Pathway Analysis with R (Recommended for flexibility)

# 1. Retrieve and prepare gene sets
python retrieveMSigDB.py -o ./gene_sets

# 2. In R/RStudio:
#    - Open padog_analysis.Rmd or camera_analysis.Rmd
#    - Configure paths section
#    - Run analysis chunks
#    - Review results and filtered significant pathways

# 3. Compare results between methods (PADOG, CAMERA)
#    - Both produce CSV and RDS outputs
#    - Allows cross-validation of findings

System Requirements

Bash scripts: Bash shell (Unix/Linux/WSL)
Python scripts: Python 3.6+
R workflows: R 4.0+, RStudio (recommended)
GSEA CLI: Java, GSEA CLI tool installed and configured
R packages: limma, PADOG, readxl, dplyr, stringr, parallel, doParallel

Installation Notes

For WSL/Windows with Git Bash:

# Ensure scripts are executable
chmod +x *.sh

# For R scripts, use RStudio or:
Rscript -e "rmarkdown::render('script.Rmd')"

For Linux/Mac:

All bash scripts should work directly. Ensure gsea-cli.sh is in PATH:

export PATH="/path/to/GSEA:$PATH"

Contributing

When adding new scripts:

Follow the naming convention: METHOD_analysis_[context]_v[number].sh|Rmd
Include a clear header with author, date, and purpose
Document input/output files clearly
Provide example usage in comments
Test before submitting a pull request

References

License

See LICENSE file for details.

Last updated: February 2025 Maintainers: BigMindLab contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSEA Pipeline

Overview

Scripts by Category

1. Data Preparation & Retrieval

`retrieveMSigDB.py`

`final_genes.sh`

2. GSEA Execution (Deprecated but maintained for reference)

`GSEA_script.sh`

3. Results Processing

`GSEA_merge.sh`

4. Alternative Pathway Enrichment Methods (R Workflows)

`padog_analysis.Rmd`

`camera_analysis.Rmd`

`md5compare.py`

Workflow Examples

Complete GSEA Analysis (CLI-based)

Pathway Analysis with R (Recommended for flexibility)

System Requirements

Installation Notes

For WSL/Windows with Git Bash:

For Linux/Mac:

Contributing

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
old		old
.gitignore		.gitignore
GSEA_merge.sh		GSEA_merge.sh
GSEA_script.sh		GSEA_script.sh
LICENSE		LICENSE
README.md		README.md
camera_analysis.Rmd		camera_analysis.Rmd
final_genes.sh		final_genes.sh
md5compare.py		md5compare.py
padog_analysis.Rmd		padog_analysis.Rmd
retrieveMSigDB.py		retrieveMSigDB.py

Folders and files

Latest commit

History

Repository files navigation

GSEA Pipeline

Overview

Scripts by Category

1. Data Preparation & Retrieval

retrieveMSigDB.py

final_genes.sh

2. GSEA Execution (Deprecated but maintained for reference)

GSEA_script.sh

3. Results Processing

GSEA_merge.sh

4. Alternative Pathway Enrichment Methods (R Workflows)

padog_analysis.Rmd

camera_analysis.Rmd

md5compare.py

Workflow Examples

Complete GSEA Analysis (CLI-based)

Pathway Analysis with R (Recommended for flexibility)

System Requirements

Installation Notes

For WSL/Windows with Git Bash:

For Linux/Mac:

Contributing

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`retrieveMSigDB.py`

`final_genes.sh`

`GSEA_script.sh`

`GSEA_merge.sh`

`padog_analysis.Rmd`

`camera_analysis.Rmd`

`md5compare.py`

Packages