DeepGO-SE: Protein function prediction as Approximate Semantic Entailment

DeepGO-SE, a novel method which predicts GO functions from protein sequences using a pretrained large language model combined with a neuro-symbolic model that exploits GO axioms and performs protein function prediction as a form of approximate semantic entailment.

This repository contains script which were used to build and train the DeepGO-SE model together with the scripts for evaluating the model's performance.

Dependencies

The code was developed and tested using python 3.10.
Clone the repository: git clone https://github.com/bio-ontology-research-group/deepgo2.git
Create virtual environment with Conda or python3-venv module.
Install PyTorch: pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
Install DGL: pip install dgl==1.1.2+cu117 -f https://data.dgl.ai/wheels/cu117/repo.html
Install other requirements: pip install -r requirements.txt

Running DeepGO-SE model (with GOPlus axioms)

Follow these instructions to obtain predictions for your proteins. You'll need around 30Gb storage and a GPU with >16Gb memory (or you can use CPU)

Download the data.tar.gz
Extract tar xvzf data.tar.gz
Run the model python predict.py -if data/example.fa

Docker container

We also provide a docker container with all dependencies installed: docker pull coolmaksat/deepgose
This repository is installed at /deepgo2 directory. To run the scripts you'll need to mount the data directory. Example:
docker run --gpus all -v $(pwd)/data:/workspace/deepgo2/data coolmaksat/deepgose python predict.py -if data/example.fa

Training the models

To train the models and reproduce our results:

Download the training-data.tar.gz
- The training data includes both UniProtKB/SwissProt dataset and the neXtProt evaluation dataset.
- go.obo, go.norm, go-plus.norm - Gene Ontology and normalized axiom files
- mf, bp and cc subfolders include:
  - train_data.pkl - training proteins
  - valid_data.pkl - validation proteins
  - test_data.pkl - testing proteins
  - nextprot_data.pkl - neXtProt dataset proteins (except cc folder)
  - terms.pkl - list of GO terms for each subontology
  - interpros.pkl - list of InterPRO ids used as features
  - ppi.bin, ppi_nextprot.bin - PPI graphs saved with DGL library
train.py and train_gat.py scripts are used to train different versions of DeepGOSE and DeepGOGATSE models correspondingly
train_cnn.py, train_mlp.py and train_dgg.py scripts are used to train baseline models DeepGOCNN, MLP and DeepGraphGO.
Examples:
- Train a single DeepGOZero MFO prediction model which uses InterPRO annotation features
  python train.py -m deepgozero -ont mf
- Train a single DeepGOZero CCO prediction model which uses ESM2 embeddings
  python train.py -m deepgozero_esm -ont cc
- Train a single DeepGOGAT BPO prediction model which uses predicted MF features
  python train_gat.py -m deepgogat_mfpreds_plus -ont bp
Training 10 models for entailment:
- DeepGO-SE models: ./train_se.sh <model_name> <ontology>
- DeepGOGAT-SE models: ./train_gat_se.sh <model_name> <ontology>

Evaluating the predictions

The training scripts generate predictions for the test data that are used to compute evaluation metrics.

To evaluate single predictions run evaluate.py script. Example:
python evaluate.py -m mlp -td test -on mf
To evaluate approximate entailment predictions use evaluate_entailment.py script. Example:
python evaluate_entailment.py -m deepgozero_esm -td nextprot -ont cc
Note: this script requires multiple trained models with performance reports on the validation set.

Generating the data

The data used in to train our models are available for download. However, if you like to generate a new dataset follow these steps:

Download Gene Ontology. You'll need go.obo, go.owl and go-plus.owl files and save them into data folder. We use Groovy scripts to process the ontology files. Install Groovy by following the instructions here and execute the following commands: \
- Normalize GO:
  groovy groovy/Normalize.groovy -i data/go.owl -o data/go.norm
- Filter out GO-Plus non EL axioms:
  groovy groovy/makeEL.groovy data/go-plus.owl data/go-plus-el.owl
- Normalize GO-Plus:
  groovy groovy/Normalize.groovy -i data/go-plus-el.owl -o data/go-plus.norm
Download UniProt-KB data and save it to the data folder.
Download StringDB v11.0 protein.actions.v11.0.txt.gz
Install Diamond
Run data generation script:
sh generate_data.sh

Citation

If you use DeepGO-SE for your research, or incorporate our learning algorithms in your work, please cite: Maxat Kulmanov, Francisco J. Guzman-Vega, Paula Duek, Lydie Lane, Stefan T. Arold, Robert Hoehndorf; DeepGO-SE: Protein function prediction as Approximate Semantic Entailment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepGO-SE: Protein function prediction as Approximate Semantic Entailment

Dependencies

Running DeepGO-SE model (with GOPlus axioms)

Docker container

Training the models

Evaluating the predictions

Generating the data

Citation

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
deepgo		deepgo
docker		docker
extra		extra
gendata		gendata
groovy		groovy
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluate_entailment.py		evaluate_entailment.py
evaluate_tsv.py		evaluate_tsv.py
generate_data.sh		generate_data.sh
naive.py		naive.py
predict.py		predict.py
requirements.txt		requirements.txt
specific_terms.py		specific_terms.py
sprof.py		sprof.py
tale.py		tale.py
train.py		train.py
train_cnn.py		train_cnn.py
train_dgg.py		train_dgg.py
train_gat.py		train_gat.py
train_gat_se.sh		train_gat_se.sh
train_mlp.py		train_mlp.py
train_se.sh		train_se.sh

License

bio-ontology-research-group/deepgo2

Folders and files

Latest commit

History

Repository files navigation

DeepGO-SE: Protein function prediction as Approximate Semantic Entailment

Dependencies

Running DeepGO-SE model (with GOPlus axioms)

Docker container

Training the models

Evaluating the predictions

Generating the data

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages