This is a Proteomics Data Analysis Platform built with Streamlit that provides comprehensive tools for analyzing protein expression data. The application allows researchers to upload proteomics datasets (CSV/Excel format), perform statistical analysis, and generate interactive visualizations. The platform is designed to handle common proteomics workflows including data preprocessing, normalization, statistical comparisons between experimental conditions, and advanced visualization techniques like heatmaps, PCA plots, and hierarchical clustering.
Preferred communication style: Simple, everyday language.
The application uses Streamlit as the web framework, providing a reactive, single-page application interface. The main application (app.py) follows a modular design pattern where the UI components are organized with a sidebar for data upload and parameter configuration, and the main content area for analysis results and visualizations.
The system implements a modular, object-oriented architecture with three main utility classes:
- ProteomicsDataProcessor: Handles all data preprocessing operations including log transformation, normalization, scaling, and condition parsing from column headers
- StatisticalAnalyzer: Manages statistical computations including t-tests, Wilcoxon rank-sum tests, and multiple testing corrections
- ProteomicsVisualizer: Creates interactive visualizations using Plotly including heatmaps, PCA plots, and clustering dendrograms
The architecture uses session state management to maintain class instances across user interactions, ensuring data persistence during the analysis session.
The system follows a scientific data analysis pipeline approach:
- Data Ingestion: Supports CSV and Excel formats with automatic protein identification (first column as index)
- Condition Parsing: Automatically extracts experimental conditions from column headers using the format
condition_replicate - Preprocessing: Implements standard proteomics preprocessing including log transformation, normalization, and scaling
- Statistical Analysis: Provides parametric (t-test) and non-parametric (Wilcoxon) testing options
- Visualization: Generates publication-ready interactive plots with clustering capabilities
- Frontend Framework: Streamlit for rapid web app development
- Data Processing: Pandas and NumPy for data manipulation
- Scientific Computing: SciPy for statistical tests, scikit-learn for preprocessing and dimensionality reduction
- Visualization: Plotly for interactive charts and graphs
- Machine Learning: UMAP and PCA for dimensionality reduction, hierarchical clustering for protein grouping
- streamlit: Web application framework for the user interface
- pandas: Data manipulation and analysis library for handling proteomics datasets
- numpy: Numerical computing library for array operations
- plotly: Interactive visualization library for creating charts and graphs
- scipy: Scientific computing library for statistical tests and clustering algorithms
- scikit-learn: Machine learning library for preprocessing, PCA, and clustering
- statsmodels: Statistical modeling library for multiple testing corrections
- umap-learn: UMAP algorithm for non-linear dimensionality reduction
The platform expects proteomics data in CSV or Excel format with:
- First column containing protein identifiers (used as row index)
- Subsequent columns following
condition_replicatenaming convention for automatic condition parsing - Numerical expression values in the data matrix
The application is designed to run entirely locally without requiring external APIs, databases, or cloud services. All computations are performed client-side using the uploaded data files.