Skip to content

WangJiuming/deer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dense Enzyme Retrieval (DEER)

Supporting code for the paper
"Exploring Functional Insights into the Human Gut Microbiome via the Structural Proteome"
(Cell Host & Microbe, 2026)

Table of Contents

About this repository

DEER (Dense Enzyme Retrieval) provides a method for finding functionally related human-bacteria isozymes using learned dense vector representations (embeddings). This repository contains the code, pre-trained models, and example data necessary to reproduce the results and apply DEER to new enzyme sequences, as presented in our paper.

Installation

First, clone the repository:

git clone https://github.com/WangJiuming/deer.git
cd deer

We recommend using Conda for managing dependencies. Choose one of the following options based on your hardware:

Option 1. GPU with Flash Attention (Recommended)

If the hardware supports Flash Attention (see the official Flash Attention repository [1] for compatibility), this option offers significant speed-ups.

conda env create --name deer --file env/env_gpu_fa.yml

Option 2. Standard GPU

If your GPU is not compatible with Flash Attention, use this standard GPU installation.

conda env create --name deer --file env/env_gpu.yml

Option 3. CPU Only

If no GPU is available, you can install the CPU-only version. Note that this will be significantly slower than GPU versions.

conda env create --name deer --file env/env_cpu.yml

After installation using any of the above options, activate the Conda environment:

conda activate deer

Getting started

Follow these steps to download the necessary resources and run the example enzyme retrieval task.

1. Download resources

1.1 Model checkpoints

The model will automatically download the necessary checkpoint files from the Hugging Face repository to the local folder ./ckpt by default. The following download procedure is therefore optional and should only be performed if the automatic process fails.

The pre-trained model checkpoints are available on servers. To download them to perform inference, run the following.

huggingface-cli download cuhkaih/deer --local-dir ./ckpt

Alternatively, in case the above link is unavailable, the checkpoint can also be downloaded manually using this link.

After download, the ./ckpt/ directory should now contain the following core files:

  • saprot_35m/: Files required for the underlying SaProt protein language model [2].
  • esm2_t12_35M_UR50D/: Files required for the underlying ESM2 language model [3].
  • deer_checkpoint.ckpt: The pre-trained DEER model checkpoint.

1.2 Dataset

Additionally, we provide an working example dataset to demonstrate the retrieval process. This dataset contains 5,849 enzyme structures and was used for benchmarking in our paper. To download the dataset, run the following.

huggingface-cli download cuhkaih/deer data.zip --local-dir ./ --repo-type dataset

In case the above link is unavailable, the example dataset can also be downloaded manually using this link.

Then decompress the file.

unzip data.zip

The data/ directory should now contain:

  • example/template_pdb/: 1,636 eukaryota templates' PDB files.
  • example/database_pdb/: 4,213 bacteria enzymes' PDB files.
  • example/metadata.csv: Metadata for all the PDB files.

2. Running the retrieval example

To perform retrieval using a group of template structures against a database using the default options:

python do_retrieval.py --template_pdb_dir ./data/example/template_pdb/ \
                       --database_pdb_dir ./data/example/database_pdb/

More options can be set according to the --help argument.

python do_retrieval.py --help

Note that if Flash Attention is installed, then the --use_fa flag argument can be set to accelerate the process. Currently, the model supports single-GPU or CPU retrieval, which can be enabled by set the environment variable CUDA_VISIBLE_DEVICES when running the script.

  • For using a specific GPU device:
CUDA_VISIBLE_DEVICES="0" python do_retrieval.py ...
  • For doing CPU-only inference:
CUDA_VISIBLE_DEVICES="" python do_retrieval.py ...

Results are saved to ./results/similarity.csv by default, containing a Pandas DataFrame with the columns:

  • eukaryota_id: Identifier for the template enzyme.
  • bacteria_id: Identifier for the bacteria enzyme.
  • distance: Euclidean distance between embeddings. Lower distance indicates higher similarity.

Note that if multiple templates are used, the retrieval results for all templates will be saved and sorted together in one file. Users may separate them during further processing.

Citation

If you use DEER or this codebase in your research, please cite our paper:

@article{liu2026exploring,
  title={Exploring functional insights into the human gut microbiome via the structural proteome},
  author={Liu, Hongbin and Shen, Juntao and Zhang, Zhiwei and Wang, Jiuming and Zhang, Chengxin and Zheng, Linggang and Ni, Haoran and Hong, Liang and Zhang, Jieqiong and Xue, Dongfang and others},
  journal={Cell Host \& Microbe},
  volume={34},
  number={1},
  pages={167--185},
  year={2026},
  publisher={Elsevier}
}

References

[1] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in neural information processing systems 35 (2022): 16344-16359.

[2] Su, Jin, et al. "Saprot: Protein language modeling with structure-aware vocabulary." bioRxiv (2023): 2023-10.

[3] Lin, Zeming, et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction." BioRxiv 2022 (2022): 500902.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages