Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov and Matthias Heinig

Foundation models, such as DNABERT and Nucleotide Transformer have recently shaped a new direction in DNA research. Trained in an unsupervised manner on a vast quantity of genomic data, they can be used for a variety of downstream tasks, such as promoter prediction, DNA methylation prediction, gene network prediction or functional variant prioritization. However, these models are often trained and evaluated on entire genomes, neglecting genome partitioning into different functional regions. In our study, we investigate the efficacy of various unsupervised approaches, including genome-wide and 3’UTR-specific foundation models on human 3’UTR regions. To this end, we train a set of popular transformer architectures on a 3’UTR-specific dataset comprising 3,783,714 3’UTR sequences (6.6B bp) of 241 Zoonomia species. Our evaluation includes downstream tasks specific for RNA biology, such as recognition of binding motifs of RNA binding proteins, detection of functional genetic variants, prediction of expression levels in massively parallel reporter assays, and estimation of mRNA half-life. Remarkably, models specifically trained on 3’UTR sequences demonstrate superior performance when compared to established genome-wide foundation models in three out of four downstream tasks. Our results underscore the importance of considering genome partitioning into distinct functional regions when training and evaluating foundation models. In addition, the proposed set of 3’UTR-specific tasks can be used for benchmarking of future models.

Codes for data preprocessing and analysis

rbp_motifs : evaluate the models on RBP binding motifs prediction (TASK 1)
variant_effect : evaluate the models on variants from ClinVar, gnomAD, eQTL, and CADD (TASK 2)
mpra : prediction of MPRA activity from (Griesemer et al., 2021) and (Siegel et al., 2022) (TASK 3)
half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) (TASK 4)
dataset_prep : build the multispecies dataset from Zoonomia whole genome alignment
embeddings : generate embeddings for the DNABERT, DNABERT-2, NT as well as embeddings and per-base zero-shot scores for StateSpace models
zero-shot-probs : derive per-base zero-shot scores for DNABERT, NT, PhyloP, and CADD models

The analysis data, scores for all models, and model weights can be found in our Zenodo repository

Links to the scripts used to generate paper figures and tables:

Fig. 1: ROC AUC scores for RBP binding motif predictions

Fig. 2: ROC curves for prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using the best predictor for each model

Fig. 3: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022)

Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome

Fig. S2: Pearson r correlation between per-nucleotide probabilities predicted by each model and the ground truth probability for the Zoonomia dataset (Zoo-AL)

Fig. S3: Difference between ROC AUC scores based on the variant influence score (VIS) and the reference allele probability (pref), as a function of the maximum window W around the variant used to compute VIS

Table 1: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA expression from (Griesemer et al., 2021)

Table S2: ROC AUC scores for RBP binding motif predictions, for all motifs, proxy-functional motifs within the top 10% conservation, proxy-functional motifs within the bottom 10% conservation, as predicted by PhyloP-241way

Table S3: ROC AUC scores for ClinVar, gnomAD, eQTL, and CADD data computed based on zero-shot functionality scores for all models

Table S4: ROC AUC scores from MLP-based prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using language model embeddings

Table S5: ROC AUC scores from prediction of proxy-functional variants on ClinVar, gnomAD, eQTL, and CADD data using alignment-based models

Table S6: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA activity from (Griesemer et al., 2021)

Table S7: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S8: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022)

Table S9: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022), using different 3’UTR embeddings

Installation

Create new conda environment:

conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models

Install Pytorch v.2.0.1
Install the other requirements using pip:

pip install -r requirements.txt

To train DNABERT-2 models also install

pip install triton==2.0.0.dev20221202 --force --no-dependencies

Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data preprocessing and analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dataset_prep		dataset_prep
embeddings		embeddings
half_life		half_life
models		models
mpra		mpra
old_and_unused		old_and_unused
rbp_motifs		rbp_motifs
utils		utils
variant_effect		variant_effect
zero-shot-probs		zero-shot-probs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

heiniglab/investigating-foundation-models-3utr

Folders and files

Latest commit

History

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data preprocessing and analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages