Skip to content

Latest commit

 

History

History
58 lines (43 loc) · 3.98 KB

File metadata and controls

58 lines (43 loc) · 3.98 KB

Overview

This is a Proteomics Data Analysis Platform built with Streamlit that provides comprehensive tools for analyzing protein expression data. The application allows researchers to upload proteomics datasets (CSV/Excel format), perform statistical analysis, and generate interactive visualizations. The platform is designed to handle common proteomics workflows including data preprocessing, normalization, statistical comparisons between experimental conditions, and advanced visualization techniques like heatmaps, PCA plots, and hierarchical clustering.

User Preferences

Preferred communication style: Simple, everyday language.

System Architecture

Frontend Architecture

The application uses Streamlit as the web framework, providing a reactive, single-page application interface. The main application (app.py) follows a modular design pattern where the UI components are organized with a sidebar for data upload and parameter configuration, and the main content area for analysis results and visualizations.

Backend Architecture

The system implements a modular, object-oriented architecture with three main utility classes:

  • ProteomicsDataProcessor: Handles all data preprocessing operations including log transformation, normalization, scaling, and condition parsing from column headers
  • StatisticalAnalyzer: Manages statistical computations including t-tests, Wilcoxon rank-sum tests, and multiple testing corrections
  • ProteomicsVisualizer: Creates interactive visualizations using Plotly including heatmaps, PCA plots, and clustering dendrograms

The architecture uses session state management to maintain class instances across user interactions, ensuring data persistence during the analysis session.

Data Processing Pipeline

The system follows a scientific data analysis pipeline approach:

  1. Data Ingestion: Supports CSV and Excel formats with automatic protein identification (first column as index)
  2. Condition Parsing: Automatically extracts experimental conditions from column headers using the format condition_replicate
  3. Preprocessing: Implements standard proteomics preprocessing including log transformation, normalization, and scaling
  4. Statistical Analysis: Provides parametric (t-test) and non-parametric (Wilcoxon) testing options
  5. Visualization: Generates publication-ready interactive plots with clustering capabilities

Technology Stack

  • Frontend Framework: Streamlit for rapid web app development
  • Data Processing: Pandas and NumPy for data manipulation
  • Scientific Computing: SciPy for statistical tests, scikit-learn for preprocessing and dimensionality reduction
  • Visualization: Plotly for interactive charts and graphs
  • Machine Learning: UMAP and PCA for dimensionality reduction, hierarchical clustering for protein grouping

External Dependencies

Python Libraries

  • streamlit: Web application framework for the user interface
  • pandas: Data manipulation and analysis library for handling proteomics datasets
  • numpy: Numerical computing library for array operations
  • plotly: Interactive visualization library for creating charts and graphs
  • scipy: Scientific computing library for statistical tests and clustering algorithms
  • scikit-learn: Machine learning library for preprocessing, PCA, and clustering
  • statsmodels: Statistical modeling library for multiple testing corrections
  • umap-learn: UMAP algorithm for non-linear dimensionality reduction

Data Format Requirements

The platform expects proteomics data in CSV or Excel format with:

  • First column containing protein identifiers (used as row index)
  • Subsequent columns following condition_replicate naming convention for automatic condition parsing
  • Numerical expression values in the data matrix

No External Services

The application is designed to run entirely locally without requiring external APIs, databases, or cloud services. All computations are performed client-side using the uploaded data files.