Skip to content

Latest commit

 

History

History
124 lines (88 loc) · 5.43 KB

File metadata and controls

124 lines (88 loc) · 5.43 KB

Protein Structure Analysis Project

Project Overview

This repository contains various implementations and experiments for protein structure analysis using different deep learning approaches.

Reproducible Pipeline (Binding Site Prediction)

The run_pipeline.py script provides an end-to-end, reproducible pipeline extracted from GraphSAGE-improving.ipynb. It supports multiple GNN backbones: GraphSAGE, GCN, and GAT.

Quick Start

# Full run with default config (GraphSAGE)
python run_pipeline.py --config configs/graphsage_default.yaml

# Select backbone via CLI
python run_pipeline.py --config configs/graphsage_default.yaml --model gcn
python run_pipeline.py --config configs/graphsage_default.yaml --model gat

# Smoke test (minimal data for quick verification)
python run_pipeline.py --config configs/graphsage_default.yaml --smoke

CLI Options

Option Description
--config Path to YAML config (default: built-in defaults)
--model GNN backbone: graphsage, gcn, gat
--device Device: cuda or cpu
--seed Random seed for reproducibility
--save-dir Output directory for checkpoints and metrics
--smoke Smoke test with 4 train + 2 test samples

Data Paths

Update configs/graphsage_default.yaml or pass a custom config:

  • train_csv: Training CSV with prot_id, sequence, labels (list format)
  • test_csv: Test CSV with same columns
  • pdb_dir: Directory containing PDB files (e.g. {prot_id}.pdb or {prot_id}_alphafold.pdb)

Artifacts

Outputs are saved under artifacts/ (or --save-dir):

  • {backbone}_best_model.pth: Best model checkpoint
  • run_metadata.json: Config, metrics, timestamp

Pipeline Layout

  • pipeline/config.py – Configuration dataclasses
  • pipeline/io.py – Data loading and path resolution
  • pipeline/embeddings.py – ESM-2 tokenization and embeddings
  • pipeline/graph_features.py – Structure features and graph construction
  • pipeline/models.py – GCN, GraphSAGE, GAT backbones
  • pipeline/losses.py – Loss functions and binding features
  • pipeline/train.py – Training loop
  • pipeline/evaluate.py – Evaluation metrics

Existing helper scripts (data_preparation.py, features_extraction.py, alphafold_data_ingestion.py, etc.) remain for creating multi-label and processed datasets.

Repository Structure

  • run_pipeline.py: Main CLI for binding site prediction
  • pipeline/: Modular pipeline package
  • configs/: YAML configuration files
  • GraphSAGE-improving.ipynb: Original notebook (reference)
  • data_preparation.py, features_extraction.py, etc.: Dataset preparation helpers

Important Notice

Before running any code:

  1. Data Paths: Update paths in configs/graphsage_default.yaml or your config
  2. PDB Files: Ensure data/esmFold_pdb_files (or your pdb_dir) contains PDB files for all protein IDs in train/test CSVs
  3. Execution: run_pipeline.py handles embeddings, graph construction, training, and evaluation in one command

Prerequisites

  • Python: 3.10 or 3.11 is recommended (broad wheel support for PyTorch and scientific stacks; some packages skip 3.9 or cap below 3.12).
  • Hardware: GPU optional; CUDA builds of PyTorch are separate from this repo.

Installation (reproducible setup)

Structure-related code (pipeline/graph_features.py, features_extraction.py, utils.py) imports MDTraj and may call DSSP (secondary structure). Follow the path that matches your OS.

Option A — Conda (recommended, especially on Windows)

Conda-forge provides prebuilt mdtraj and the mkdssp (DSSP) binary, avoiding MSVC compiler errors and missing PyPI packages.

conda create -n esm-orion python=3.11 -y
conda activate esm-orion
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install -y mdtraj mkdssp
pip install -r requirements.txt
  • mdtraj: On Windows, pip install mdtraj often falls back to a source build and fails without Microsoft C++ Build Tools. Installing mdtraj from conda-forge avoids that.
  • mkdssp: DSSP is a standalone program, not a Python package named dssp on PyPI. Do not add dssp to requirements.txt. After conda install mkdssp, the executable should be on your PATH inside the env.
  • Biopython DSSP: features_extraction.py uses DSSP(..., dssp='/usr/bin/dssp') for the Biopython-based path. On Windows, point that argument to your conda env’s mkdssp.exe (for example under %CONDA_PREFIX%\Scripts\). The MDTraj-based helpers use md.compute_dssp and rely on mkdssp on PATH.

Option B — pip-only (Linux / macOS, or Windows with C++ Build Tools)

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
  • Windows + pip: If mdtraj tries to compile from source, install Visual Studio Build Tools (Desktop development with C++) or prefer Option A.
  • DSSP binary: Still required for md.compute_dssp / Biopython DSSP. Install mkdssp from your OS package manager, conda-forge, or build from source, and ensure it is on PATH.

Optional checks

python -c "import mdtraj as md; print('mdtraj', md.__version__)"
mkdssp --version   # or: dssp --version

Contact

For questions or issues, please open a GitHub issue in this repository.