A Multi-modal LLM-Knowledge Fusion Framework for Predicting Single-cell Genetic Perturbation Effects
scPert, a multi-modal transformer framework that integrates large language model embeddings with structured biological knowledge to predict single-cell transcriptomic responses to genetic perturbations. Through hierarchical fusion of knowledge graph representations, contextual embeddings from foundation models, and gene-specific encodings, scPert achieves significant performance improvements in both single-gene and combinatorial perturbations over existing methods. In cancer-relevant applications, scPert demonstrates the capability to reveal p53 pathway dynamics and immune checkpoint regulatory mechanisms. Systematic evaluation on 42 cancer dependency genes demonstrates scPert's ability to identify critical potential therapeutic targets. Our framework establishes a powerful computational foundation for virtual cell construction and accelerates drug target discovery.
Install PyG, and then do pip install scpert
- anndata==0.9.2
- scanpy==1.9.8
- torch==2.3.0
- torch-geometric==2.6.1
- scvi-tools==0.20.3
- pandas==2.0.3
- numpy==1.24.4
- scipy==1.10.1
- cell-gears==0.0.2
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cudnn-cu12==8.9.2.26
- flash_attn==0.2.8
embedding_dir: Directory containing gene embedding files (.npy)
data_path: Base directory for perturbation datasets
model_path: Pretrained model directory containing model.pt
pert_file:CSV file specifying perturbation pairs with columns: gene1,gene2
You can train scPert on your perturbation dataset simply running the Python script:
python ./scripts/train.py
You can use scPert to predict single gene or gene pairs perturbations by running the scripts:
python ./scripts/infer.py
Using API Interface:
from scpert import predict,train,ProcePertdata,scpert
pertData = ProcePertdata(data_path)
pertData.load(DataName = 'norman')
# training
train(data_path, DataName, model_save_path, embedding_path)
# predict
predict(model_path, data_path, pert_file_path, model_name='norman', device='auto',embedding_path, output_dir='./results')
This work is currently under peer review.
