This repository contains various implementations and experiments for protein structure analysis using different deep learning approaches.
The run_pipeline.py script provides an end-to-end, reproducible pipeline extracted from GraphSAGE-improving.ipynb. It supports multiple GNN backbones: GraphSAGE, GCN, and GAT.
# Full run with default config (GraphSAGE)
python run_pipeline.py --config configs/graphsage_default.yaml
# Select backbone via CLI
python run_pipeline.py --config configs/graphsage_default.yaml --model gcn
python run_pipeline.py --config configs/graphsage_default.yaml --model gat
# Smoke test (minimal data for quick verification)
python run_pipeline.py --config configs/graphsage_default.yaml --smoke| Option | Description |
|---|---|
--config |
Path to YAML config (default: built-in defaults) |
--model |
GNN backbone: graphsage, gcn, gat |
--device |
Device: cuda or cpu |
--seed |
Random seed for reproducibility |
--save-dir |
Output directory for checkpoints and metrics |
--smoke |
Smoke test with 4 train + 2 test samples |
Update configs/graphsage_default.yaml or pass a custom config:
train_csv: Training CSV withprot_id,sequence,labels(list format)test_csv: Test CSV with same columnspdb_dir: Directory containing PDB files (e.g.{prot_id}.pdbor{prot_id}_alphafold.pdb)
Outputs are saved under artifacts/ (or --save-dir):
{backbone}_best_model.pth: Best model checkpointrun_metadata.json: Config, metrics, timestamp
pipeline/config.py– Configuration dataclassespipeline/io.py– Data loading and path resolutionpipeline/embeddings.py– ESM-2 tokenization and embeddingspipeline/graph_features.py– Structure features and graph constructionpipeline/models.py– GCN, GraphSAGE, GAT backbonespipeline/losses.py– Loss functions and binding featurespipeline/train.py– Training looppipeline/evaluate.py– Evaluation metrics
Existing helper scripts (data_preparation.py, features_extraction.py, alphafold_data_ingestion.py, etc.) remain for creating multi-label and processed datasets.
run_pipeline.py: Main CLI for binding site predictionpipeline/: Modular pipeline packageconfigs/: YAML configuration filesGraphSAGE-improving.ipynb: Original notebook (reference)data_preparation.py,features_extraction.py, etc.: Dataset preparation helpers
Before running any code:
- Data Paths: Update paths in
configs/graphsage_default.yamlor your config - PDB Files: Ensure
data/esmFold_pdb_files(or yourpdb_dir) contains PDB files for all protein IDs in train/test CSVs - Execution:
run_pipeline.pyhandles embeddings, graph construction, training, and evaluation in one command
- Python: 3.10 or 3.11 is recommended (broad wheel support for PyTorch and scientific stacks; some packages skip 3.9 or cap below 3.12).
- Hardware: GPU optional; CUDA builds of PyTorch are separate from this repo.
Structure-related code (pipeline/graph_features.py, features_extraction.py, utils.py) imports MDTraj and may call DSSP (secondary structure). Follow the path that matches your OS.
Conda-forge provides prebuilt mdtraj and the mkdssp (DSSP) binary, avoiding MSVC compiler errors and missing PyPI packages.
conda create -n esm-orion python=3.11 -y
conda activate esm-orion
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install -y mdtraj mkdssp
pip install -r requirements.txtmdtraj: On Windows,pip install mdtrajoften falls back to a source build and fails without Microsoft C++ Build Tools. Installing mdtraj from conda-forge avoids that.mkdssp: DSSP is a standalone program, not a Python package nameddsspon PyPI. Do not adddssptorequirements.txt. Afterconda install mkdssp, the executable should be on yourPATHinside the env.- Biopython
DSSP:features_extraction.pyusesDSSP(..., dssp='/usr/bin/dssp')for the Biopython-based path. On Windows, point that argument to your conda env’smkdssp.exe(for example under%CONDA_PREFIX%\Scripts\). The MDTraj-based helpers usemd.compute_dsspand rely onmkdssponPATH.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt- Windows + pip: If
mdtrajtries to compile from source, install Visual Studio Build Tools (Desktop development with C++) or prefer Option A. - DSSP binary: Still required for
md.compute_dssp/ Biopython DSSP. Installmkdsspfrom your OS package manager, conda-forge, or build from source, and ensure it is onPATH.
python -c "import mdtraj as md; print('mdtraj', md.__version__)"
mkdssp --version # or: dssp --versionFor questions or issues, please open a GitHub issue in this repository.