Skip to content

samvictordr/reefgene-exp-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coral Transcriptome Differential Expression Analysis Suite

A comprehensive computational pipeline for simulating and analyzing coral reef transcriptomic responses to environmental stressors, implementing standardized RNA-seq differential expression workflows with interactive visualization capabilities.

Overview

A bioinformatics tool designed for marine genomics researchers studying cnidarian transcriptional responses to environmental perturbations. The application provides a complete analytical framework for coral gene expression profiling, incorporating established methodologies from the field of marine molecular ecology and transcriptomics.

Key Features

  • Transcriptome Simulation Engine: Generates realistic coral gene expression count matrices with configurable parameters mimicking RNA-seq data characteristics
  • Differential Expression Analysis: Implements Welch's t-test with Benjamini-Hochberg FDR correction for identifying stress-responsive genes
  • CPM Normalization Pipeline: Counts per million (CPM) transformation with log₂ scaling for variance stabilization
  • Principal Component Analysis: Dimensionality reduction for sample clustering and batch effect detection
  • Interactive Volcano Plots: Statistical significance visualization with log₂ fold-change thresholds
  • Expression Heatmaps: Hierarchical clustering of top differentially expressed genes (DEGs)

Methodology

Data Simulation Framework

The simulation engine generates synthetic coral transcriptome data using Poisson-distributed count models:

  • Baseline Expression: Control samples drawn from Poisson(λ=50) distribution
  • Stress Response Modeling:
    • Upregulated genes (n=10): Additional Poisson(λ=40) counts
    • Downregulated genes (n=10): Reduced by Poisson(λ=20) counts
    • Background genes: Maintain baseline expression levels

Statistical Analysis Pipeline

  1. Count Matrix Loading: Import expression data in standard CSV format with genes as rows and samples as columns

  2. CPM Normalization:

    CPM = (raw_counts / library_size) × 10⁶
    log₂CPM = log₂(CPM + 1)
    
  3. Differential Expression Testing:

    • Statistical method: Welch's two-sample t-test (unequal variances)
    • Multiple testing correction: Benjamini-Hochberg FDR (α = 0.05)
    • Effect size: log₂ fold-change calculation with pseudocount adjustment
  4. Dimensionality Reduction:

    • Principal Component Analysis on log₂CPM-transformed data
    • Variance explained reporting for PC1 and PC2

Visualization Components

  • Volcano Plot: -log₁₀(p-value) vs log₂(fold-change) with significance thresholding
  • PCA Biplot: Sample ordination colored by treatment group with variance contribution
  • Expression Heatmap: Z-score normalized expression of top 25 DEGs with hierarchical clustering

Installation & Dependencies

System Requirements

  • Python ≥ 3.8
  • 4GB RAM minimum (recommended: 8GB for large datasets)

Package Dependencies

pip install -r requirements.txt

Core Libraries:

  • streamlit: Web application framework
  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • scipy: Statistical functions
  • statsmodels: Advanced statistical modeling
  • scikit-learn: Machine learning and PCA implementation
  • plotly: Interactive plotting
  • seaborn: Statistical data visualization
  • matplotlib: Publication-quality figures

Usage

Web Application Launch

streamlit run app.py

The application will launch in your default web browser at http://localhost:8501.

Data Input Options

  1. Simulated Data Generation (Default):

    • Toggle "Use existing data" checkbox OFF
    • Generates fresh mock expression matrix (200 genes × 6 samples)
    • Sample naming convention: Ctrl_1, Ctrl_2, Ctrl_3, Stress_1, Stress_2, Stress_3
  2. Custom Data Upload:

    • Toggle "Use existing data" checkbox ON
    • Requires CSV format with gene identifiers as row names
    • Column headers must follow Ctrl_* and Stress_* naming pattern

Data Format Specifications

Input CSV Structure:

gene_id,Ctrl_1,Ctrl_2,Ctrl_3,Stress_1,Stress_2,Stress_3
Gene_001,45,52,48,89,76,84
Gene_002,67,61,59,45,52,48
...

Requirements:

  • Gene identifiers in first column
  • Sample names containing "Ctrl" for control samples
  • Sample names containing "Stress" for treatment samples
  • Raw count data (non-negative integers)

Interpretation Guidelines

Volcano Plot Analysis

  • X-axis: log₂ fold-change (positive = upregulated in stress, negative = downregulated)
  • Y-axis: -log₁₀(p-value) (higher = more statistically significant)
  • Color coding:
    • Red/highlighted points: FDR-adjusted p-value < 0.05
    • Gray points: Non-significant genes

PCA Interpretation

  • Sample clustering: Groups of samples with similar expression profiles
  • Treatment separation: Distance between control and stress sample clusters
  • Variance explained: Percentage of total expression variance captured by each PC
  • Quality assessment: Tight within-group clustering indicates good biological replication

Heatmap Analysis

  • Rows: Top 25 most significantly differentially expressed genes
  • Columns: Individual sample replicates
  • Color scale: Z-score normalized expression (red = high, blue = low)
  • Clustering: Hierarchical clustering reveals gene co-expression patterns

Statistical Significance Thresholds

  • p-value: Raw statistical significance from t-test
  • FDR-adjusted p-value: Multiple testing corrected significance (α = 0.05)
  • Fold-change: Biological effect size (typical threshold: |log₂FC| > 1)

Output Data

The application generates several data products:

  1. Differential Expression Results Table:

    • Gene identifiers
    • log₂ fold-change values
    • Raw p-values
    • FDR-adjusted p-values
    • Ranked by statistical significance
  2. Interactive Visualizations:

    • Volcano plot (HTML/SVG export compatible)
    • PCA biplot with variance statistics
    • Expression heatmap with clustering dendrograms

License

This project is released under the Creative Commons CC0 1.0 Universal License - see the LICENSE file for details.

About

Coral Reef Gene Expression Simulator built on streamlit and seaborn.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages