This repository contains various implementations and experiments for protein structure analysis using different deep learning approaches.
The run_pipeline.py script provides an end-to-end, reproducible pipeline extracted from GraphSAGE-improving.ipynb. It supports multiple GNN backbones: GraphSAGE, GCN, and GAT.
# Full run with default config (GraphSAGE)
python run_pipeline.py --config configs/graphsage_default.yaml
# Select backbone via CLI
python run_pipeline.py --config configs/graphsage_default.yaml --model gcn
python run_pipeline.py --config configs/graphsage_default.yaml --model gat
# Smoke test (minimal data for quick verification)
python run_pipeline.py --config configs/graphsage_default.yaml --smoke| Option | Description |
|---|---|
--config |
Path to YAML config (default: built-in defaults) |
--model |
GNN backbone: graphsage, gcn, gat |
--device |
Device: cuda or cpu |
--seed |
Random seed for reproducibility |
--save-dir |
Output directory for checkpoints and metrics |
--smoke |
Smoke test with 4 train + 2 test samples |
Update configs/graphsage_default.yaml or pass a custom config:
train_csv: Training CSV withprot_id,sequence,labels(list format)test_csv: Test CSV with same columnspdb_dir: Directory containing PDB files (e.g.{prot_id}.pdbor{prot_id}_alphafold.pdb)
Outputs are saved under artifacts/ (or --save-dir):
{backbone}_best_model.pth: Best model checkpointrun_metadata.json: Config, metrics, timestamp
pipeline/config.py– Configuration dataclassespipeline/io.py– Data loading and path resolutionpipeline/embeddings.py– ESM-2 tokenization and embeddingspipeline/graph_features.py– Structure features and graph constructionpipeline/models.py– GCN, GraphSAGE, GAT backbonespipeline/losses.py– Loss functions and binding featurespipeline/train.py– Training looppipeline/evaluate.py– Evaluation metrics
Existing helper scripts (data_preparation.py, features_extraction.py, alphafold_data_ingestion.py, etc.) remain for creating multi-label and processed datasets.
run_pipeline.py: Main CLI for binding site predictionpipeline/: Modular pipeline packageconfigs/: YAML configuration filesGraphSAGE-improving.ipynb: Original notebook (reference)data_preparation.py,features_extraction.py, etc.: Dataset preparation helpers
Before running any code:
- Data Paths: Update paths in
configs/graphsage_default.yamlor your config - PDB Files: Ensure
data/esmFold_pdb_files(or yourpdb_dir) contains PDB files for all protein IDs in train/test CSVs - Execution:
run_pipeline.pyhandles embeddings, graph construction, training, and evaluation in one command
- Python 3.9+
- Required packages:
pip install -r requirements.txt - PyTorch, torch-geometric, transformers, mdtraj, pykan
For questions or issues, please open a GitHub issue in this repository.