Supporting code for the paper
"Exploring Functional Insights into the Human Gut Microbiome via the Structural Proteome"
(Cell Host & Microbe, 2026)
DEER (Dense Enzyme Retrieval) provides a method for finding functionally related human-bacteria isozymes using learned dense vector representations (embeddings). This repository contains the code, pre-trained models, and example data necessary to reproduce the results and apply DEER to new enzyme sequences, as presented in our paper.
First, clone the repository:
git clone https://github.com/WangJiuming/deer.git
cd deerWe recommend using Conda for managing dependencies. Choose one of the following options based on your hardware:
Option 1. GPU with Flash Attention (Recommended)
If the hardware supports Flash Attention (see the official Flash Attention repository [1] for compatibility), this option offers significant speed-ups.
conda env create --name deer --file env/env_gpu_fa.ymlOption 2. Standard GPU
If your GPU is not compatible with Flash Attention, use this standard GPU installation.
conda env create --name deer --file env/env_gpu.ymlOption 3. CPU Only
If no GPU is available, you can install the CPU-only version. Note that this will be significantly slower than GPU versions.
conda env create --name deer --file env/env_cpu.ymlAfter installation using any of the above options, activate the Conda environment:
conda activate deerFollow these steps to download the necessary resources and run the example enzyme retrieval task.
The model will automatically download the necessary checkpoint files from the Hugging Face repository to the local folder ./ckpt by default.
The following download procedure is therefore optional and should only be performed if the automatic process fails.
The pre-trained model checkpoints are available on servers. To download them to perform inference, run the following.
huggingface-cli download cuhkaih/deer --local-dir ./ckptAlternatively, in case the above link is unavailable, the checkpoint can also be downloaded manually using this link.
After download, the ./ckpt/ directory should now contain the following core files:
saprot_35m/: Files required for the underlying SaProt protein language model [2].esm2_t12_35M_UR50D/: Files required for the underlying ESM2 language model [3].deer_checkpoint.ckpt: The pre-trained DEER model checkpoint.
Additionally, we provide an working example dataset to demonstrate the retrieval process. This dataset contains 5,849 enzyme structures and was used for benchmarking in our paper. To download the dataset, run the following.
huggingface-cli download cuhkaih/deer data.zip --local-dir ./ --repo-type datasetIn case the above link is unavailable, the example dataset can also be downloaded manually using this link.
Then decompress the file.
unzip data.zipThe data/ directory should now contain:
example/template_pdb/: 1,636 eukaryota templates' PDB files.example/database_pdb/: 4,213 bacteria enzymes' PDB files.example/metadata.csv: Metadata for all the PDB files.
To perform retrieval using a group of template structures against a database using the default options:
python do_retrieval.py --template_pdb_dir ./data/example/template_pdb/ \
--database_pdb_dir ./data/example/database_pdb/More options can be set according to the --help argument.
python do_retrieval.py --helpNote that if Flash Attention is installed, then the --use_fa flag argument can be set to accelerate the process.
Currently, the model supports single-GPU or CPU retrieval, which can be enabled by set the environment variable CUDA_VISIBLE_DEVICES when running the script.
- For using a specific GPU device:
CUDA_VISIBLE_DEVICES="0" python do_retrieval.py ...- For doing CPU-only inference:
CUDA_VISIBLE_DEVICES="" python do_retrieval.py ...Results are saved to ./results/similarity.csv by default, containing a Pandas DataFrame with the columns:
eukaryota_id: Identifier for the template enzyme.bacteria_id: Identifier for the bacteria enzyme.distance: Euclidean distance between embeddings. Lower distance indicates higher similarity.
Note that if multiple templates are used, the retrieval results for all templates will be saved and sorted together in one file. Users may separate them during further processing.
If you use DEER or this codebase in your research, please cite our paper:
@article{liu2026exploring,
title={Exploring functional insights into the human gut microbiome via the structural proteome},
author={Liu, Hongbin and Shen, Juntao and Zhang, Zhiwei and Wang, Jiuming and Zhang, Chengxin and Zheng, Linggang and Ni, Haoran and Hong, Liang and Zhang, Jieqiong and Xue, Dongfang and others},
journal={Cell Host \& Microbe},
volume={34},
number={1},
pages={167--185},
year={2026},
publisher={Elsevier}
}
[1] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in neural information processing systems 35 (2022): 16344-16359.
[2] Su, Jin, et al. "Saprot: Protein language modeling with structure-aware vocabulary." bioRxiv (2023): 2023-10.
[3] Lin, Zeming, et al. "Language models of protein sequences at the scale of evolution enable accurate structure prediction." BioRxiv 2022 (2022): 500902.