Sergey Vilov and Matthias Heinig
Foundation models, such as DNABERT and Nucleotide Transformer have recently shaped a new direction in DNA research. Trained in an unsupervised manner on a vast quantity of genomic data, they can be used for a variety of downstream tasks, such as promoter prediction, DNA methylation prediction, gene network prediction or functional variant prioritization. However, these models are often trained and evaluated on entire genomes, neglecting genome partitioning into different functional regions. In our study, we investigate the efficacy of various unsupervised approaches, including genome-wide and 3’UTR-specific foundation models on human 3’UTR regions. To this end, we train a set of popular transformer architectures on a 3’UTR-specific dataset comprising 3,783,714 3’UTR sequences (6.6B bp) of 241 Zoonomia species. Our evaluation includes downstream tasks specific for RNA biology, such as recognition of binding motifs of RNA binding proteins, detection of functional genetic variants, prediction of expression levels in massively parallel reporter assays, and estimation of mRNA half-life. Remarkably, models specifically trained on 3’UTR sequences demonstrate superior performance when compared to established genome-wide foundation models in three out of four downstream tasks. Our results underscore the importance of considering genome partitioning into distinct functional regions when training and evaluating foundation models. In addition, the proposed set of 3’UTR-specific tasks can be used for benchmarking of future models.
- rbp_motifs : evaluate the models on RBP binding motifs prediction (TASK 1)
- variant_effect : evaluate the models on variants from ClinVar, gnomAD, eQTL, and CADD (TASK 2)
- mpra : prediction of MPRA activity from (Griesemer et al., 2021) and (Siegel et al., 2022) (TASK 3)
- half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) (TASK 4)
- dataset_prep : build the multispecies dataset from Zoonomia whole genome alignment
- embeddings : generate embeddings for the DNABERT, DNABERT-2, NT as well as embeddings and per-base zero-shot scores for StateSpace models
- zero-shot-probs : derive per-base zero-shot scores for DNABERT, NT, PhyloP, and CADD models
The analysis data, scores for all models, and model weights can be found in our Zenodo repository
Fig. 1: ROC AUC scores for RBP binding motif predictions
Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome
- Create new conda environment:
conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models
-
Install Pytorch v.2.0.1
-
Install the other requirements using pip:
pip install -r requirements.txt
- To train DNABERT-2 models also install
pip install triton==2.0.0.dev20221202 --force --no-dependencies
Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.