Skip to content

DLii-Research/dbtk-dnabert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbtk-dnabert

An implementation of DNABERT using Pytorch and the deepbio-toolkit library.

Getting Started

  1. Install the dbtk-dnabert package
pip install dbtk-dnabert
  1. Pull pre-trained DNABERT model
from dnabert import DnaBert

# Load the pre-trained model
model = DnaBert.from_pretrained("SirDavidLudwig/dnabert", revision="64d-silva16s-250bp")

Examples

Embed DNA sequences

# Sequences to embed
sequences = [
    "ACTGAATGAGAC",
    "TTGAGTAGCCAA"
]

# Tokenize sequences
sequence_tokens = torch.tensor([model.tokenizer(sequence) for sequence in sequences])

# Embed sequences
output = model(sequence_tokens)

# Sequence-level embeddings from class token
embeddings = output["class"]

# Sequence-level embeddings from averaged tokens
embeddings = output["tokens"].mean(dim=1)

Pre-trained Models

Model Name Embedding Dim. Maximum Length Pre-training Dataset
64d-silva16s-250bp 64 250bp Silva 16S
768d-silva16s-250bp 768 250bp Silva 16S

Development

1. Model Configuration

Template model configurations can be generated using the dbtk model config command.

2. Pre-training

The model can be pre-trained using the supplied configurations with the command:

dbtk model fit \
    -c ./configs/datamodules/pretrain_silva_16s_250bp.yaml \
    -c ./configs/models/pretrain_dnabert_768d_250bp.yaml \
    -c ./configs/trainers/pretrainer.yaml \
    ./logs/dnabert_768d_250bp

3. Exporting

The trained model can be exported to a Huggingface model with the following command.

dbtk model export ./logs/dnabert_768d_250bp/last.ckpt ./exports/dnabert_768d_250bp

About

A pytorch implementation of DNABERT using deep-dna.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages