Protein Structure Analysis Project

Project Overview

This repository contains various implementations and experiments for protein structure analysis using different deep learning approaches.

Reproducible Pipeline (Binding Site Prediction)

The run_pipeline.py script provides an end-to-end, reproducible pipeline extracted from GraphSAGE-improving.ipynb. It supports multiple GNN backbones: GraphSAGE, GCN, and GAT.

Quick Start

# Full run with default config (GraphSAGE)
python run_pipeline.py --config configs/graphsage_default.yaml

# Select backbone via CLI
python run_pipeline.py --config configs/graphsage_default.yaml --model gcn
python run_pipeline.py --config configs/graphsage_default.yaml --model gat

# Smoke test (minimal data for quick verification)
python run_pipeline.py --config configs/graphsage_default.yaml --smoke

CLI Options

Option	Description
`--config`	Path to YAML config (default: built-in defaults)
`--model`	GNN backbone: `graphsage`, `gcn`, `gat`
`--device`	Device: `cuda` or `cpu`
`--seed`	Random seed for reproducibility
`--save-dir`	Output directory for checkpoints and metrics
`--smoke`	Smoke test with 4 train + 2 test samples

Data Paths

Update configs/graphsage_default.yaml or pass a custom config:

train_csv: Training CSV with prot_id, sequence, labels (list format)
test_csv: Test CSV with same columns
pdb_dir: Directory containing PDB files (e.g. {prot_id}.pdb or {prot_id}_alphafold.pdb)

Artifacts

Outputs are saved under artifacts/ (or --save-dir):

{backbone}_best_model.pth: Best model checkpoint
run_metadata.json: Config, metrics, timestamp

Pipeline Layout

pipeline/config.py – Configuration dataclasses
pipeline/io.py – Data loading and path resolution
pipeline/embeddings.py – ESM-2 tokenization and embeddings
pipeline/graph_features.py – Structure features and graph construction
pipeline/models.py – GCN, GraphSAGE, GAT backbones
pipeline/losses.py – Loss functions and binding features
pipeline/train.py – Training loop
pipeline/evaluate.py – Evaluation metrics

Existing helper scripts (data_preparation.py, features_extraction.py, alphafold_data_ingestion.py, etc.) remain for creating multi-label and processed datasets.

Repository Structure

run_pipeline.py: Main CLI for binding site prediction
pipeline/: Modular pipeline package
configs/: YAML configuration files
GraphSAGE-improving.ipynb: Original notebook (reference)
data_preparation.py, features_extraction.py, etc.: Dataset preparation helpers

Important Notice

Before running any code:

Data Paths: Update paths in configs/graphsage_default.yaml or your config
PDB Files: Ensure data/esmFold_pdb_files (or your pdb_dir) contains PDB files for all protein IDs in train/test CSVs
Execution: run_pipeline.py handles embeddings, graph construction, training, and evaluation in one command

Prerequisites

Python: 3.10 or 3.11 is recommended (broad wheel support for PyTorch and scientific stacks; some packages skip 3.9 or cap below 3.12).
Hardware: GPU optional; CUDA builds of PyTorch are separate from this repo.

Installation (reproducible setup)

Structure-related code (pipeline/graph_features.py, features_extraction.py, utils.py) imports MDTraj and may call DSSP (secondary structure). Follow the path that matches your OS.

Option A — Conda (recommended, especially on Windows)

Conda-forge provides prebuilt mdtraj and the mkdssp (DSSP) binary, avoiding MSVC compiler errors and missing PyPI packages.

conda create -n esm-orion python=3.11 -y
conda activate esm-orion
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install -y mdtraj mkdssp
pip install -r requirements.txt

mdtraj: On Windows, pip install mdtraj often falls back to a source build and fails without Microsoft C++ Build Tools. Installing mdtraj from conda-forge avoids that.
mkdssp: DSSP is a standalone program, not a Python package named dssp on PyPI. Do not add dssp to requirements.txt. After conda install mkdssp, the executable should be on your PATH inside the env.
Biopython DSSP: features_extraction.py uses DSSP(..., dssp='/usr/bin/dssp') for the Biopython-based path. On Windows, point that argument to your conda env’s mkdssp.exe (for example under %CONDA_PREFIX%\Scripts\). The MDTraj-based helpers use md.compute_dssp and rely on mkdssp on PATH.

Option B — pip-only (Linux / macOS, or Windows with C++ Build Tools)

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt

Windows + pip: If mdtraj tries to compile from source, install Visual Studio Build Tools (Desktop development with C++) or prefer Option A.
DSSP binary: Still required for md.compute_dssp / Biopython DSSP. Install mkdssp from your OS package manager, conda-forge, or build from source, and ensure it is on PATH.

Optional checks

python -c "import mdtraj as md; print('mdtraj', md.__version__)"
mkdssp --version   # or: dssp --version

Contact

For questions or issues, please open a GitHub issue in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein Structure Analysis Project

Project Overview

Reproducible Pipeline (Binding Site Prediction)

Quick Start

CLI Options

Data Paths

Artifacts

Pipeline Layout

Repository Structure

Important Notice

Prerequisites

Installation (reproducible setup)

Option A — Conda (recommended, especially on Windows)

Option B — pip-only (Linux / macOS, or Windows with C++ Build Tools)

Optional checks

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Protein Structure Analysis Project

Project Overview

Reproducible Pipeline (Binding Site Prediction)

Quick Start

CLI Options

Data Paths

Artifacts

Pipeline Layout

Repository Structure

Important Notice

Prerequisites

Installation (reproducible setup)

Option A — Conda (recommended, especially on Windows)

Option B — pip-only (Linux / macOS, or Windows with C++ Build Tools)

Optional checks

Contact