We developed a novel context-only dependent protein function prediction method by leveraging the transformer model on bacterial genomic context. This repository contains scripts which we used to train BERT model, along with scripts we used for function prediction and evaluation of contextual approach.
- The code was developed and tested using Python 3.9.13
- Clone the repository
git clone https://github.com/bio-ontology-research-group/Genomic_context.git
- Create conda environment
conda create --name genomic_context python=3.9.13
- Activate your environment
conda activate genomic_context
- Install dependencies
pip install -r requirements.txt
- The training data we used in this study is deposited in Zenodo database under accession code 10.5281/zenodo.13932747 (https://doi.org/10.5281/zenodo.13932747). Data includes NLP formatted genomes, cluster-representative protein sequences and mmseqs2 clustering results.
- BERT_word2vec_benchmark - contains scripts to run BERT and word2vec evaluations. The genome corpus for evaluation can be obtained via following link. Pre-trained BERT model exported to HF Hub
- Defense_InterPro's - contains tsv files with InterPro ID's annotating bacterial defense systems. Data obtained from InterPro website
- Secretion_InterPro's - contains tsv file with InterPro ID's annotating bacterial secretion systems. Data obtained from InterPro website
If you find this work useful in your work, please cite our paper:
@article{Toibazar2024,
title = {Context-based protein function prediction in bacterial genomes},
url = {http://dx.doi.org/10.1101/2024.10.14.618363},
DOI = {10.1101/2024.10.14.618363},
publisher = {Cold Spring Harbor Laboratory},
author = {Toibazar, Daulet and Kulmanov, Maxat and Hoehndorf, Robert},
year = {2024},
month = oct
}