An implementation of DNABERT using Pytorch and the deepbio-toolkit library.
- Install the dbtk-dnabert package
pip install dbtk-dnabert- Pull pre-trained DNABERT model
from dnabert import DnaBert
# Load the pre-trained model
model = DnaBert.from_pretrained("SirDavidLudwig/dnabert", revision="64d-silva16s-250bp")Embed DNA sequences
# Sequences to embed
sequences = [
"ACTGAATGAGAC",
"TTGAGTAGCCAA"
]
# Tokenize sequences
sequence_tokens = torch.tensor([model.tokenizer(sequence) for sequence in sequences])
# Embed sequences
output = model(sequence_tokens)
# Sequence-level embeddings from class token
embeddings = output["class"]
# Sequence-level embeddings from averaged tokens
embeddings = output["tokens"].mean(dim=1)| Model Name | Embedding Dim. | Maximum Length | Pre-training Dataset |
|---|---|---|---|
64d-silva16s-250bp |
64 | 250bp | Silva 16S |
768d-silva16s-250bp |
768 | 250bp | Silva 16S |
Template model configurations can be generated using the dbtk model config command.
The model can be pre-trained using the supplied configurations with the command:
dbtk model fit \
-c ./configs/datamodules/pretrain_silva_16s_250bp.yaml \
-c ./configs/models/pretrain_dnabert_768d_250bp.yaml \
-c ./configs/trainers/pretrainer.yaml \
./logs/dnabert_768d_250bpThe trained model can be exported to a Huggingface model with the following command.
dbtk model export ./logs/dnabert_768d_250bp/last.ckpt ./exports/dnabert_768d_250bp