Skip to content

Latest commit

 

History

History
204 lines (163 loc) · 18.2 KB

File metadata and controls

204 lines (163 loc) · 18.2 KB

Protein-Language-Models

At the intersection of the rapidly growing biological data landscape and advancements in Natural Language Processing (NLP), protein language models (PLMs) have emerged as a transformative force in modern research. These models have achieved remarkable progress, highlighting the need for timely and comprehensive overviews. However, much of the existing literature focuses narrowly on specific domains, often missing a broader analysis of PLMs. This study provides a systematic review of PLMs from a macro perspective, covering key historical milestones and current mainstream trends. We focus on the models themselves and their evaluation metrics, exploring aspects such as model architectures, positional encoding, scaling laws, and datasets. In the evaluation section, we discuss benchmarks and downstream applications. To further support ongoing research, we introduce relevant mainstream tools. Lastly, we critically examine the key challenges and limitations in this rapidly evolving field.

News

Overview

This is the overview of our article.

Protein-Language-Models-Overview

Contents

Models

We categorize protein models into two sections: Non-transformer-based models and Transformer-based models. The Transformer-based models are further divided into three parts: Encoder-only models, Decoder-only models, and Encoder-decoder models. In the following table, we provide related information on each model, including the paper link, release time, parameters, base model, pretraining dataset, and whether the model is open-source, along with the link to the open-source code for users to reference(Models are sorted alphabetically by their names).

Non-transformer-based models

Model Time Params Base model Pretraining Dataset Code
CARP 2024.02 600K-640M CNN UniRef50
MIF-ST 2023.03 3.4M GNN CATH
ProSE 2021.06 - LSTM UniRef90, SCOP
ProtVec 2015.11 - Skip-gram UniProtKB/Swiss-Prot ×
ProtVecX 2019.03 - ProtVec UniRef50, UniProtKB/Swiss-Prot ×
SeqVec 2019.12 - ELMo UniRef50
Seq2vec 2020.09 - CNN-LSTM - ×
UDSMProt 2020.01 - AWD-LSTM UniProtKB/Swiss-Prot ×
UniRep 2019.10 - mLSTM UniRef50

Transformer-based models

Encoder-only models

Model Time Params Pretraining Dataset Code
AbLang 2022.06 - OAS
AbLang2 2024.02 - OAS
AMPLIFY 2024.09 120M/350M UniRef50, UniRef100, OAS, SCOP
AminoBert 2022.10 - UniRef90, PDB, MGnify ×
AntiBERTa 2022.07 86M OAS
AntiBERTy 2021.12 26M OAS
BALM 2024.05 - OAS
DistilProtBert 2022.09 230M UniRef50
ESM-1b 2020.02 650M UniRef50
ESM-1v 2021.02 650M UniRef90
ESM-2 2023.03 8M-15B UniRef50
ESM-3 2024.07 98B UniRef, MGnify, AlphaFoldDB, ESMAtlas
ESM All-Atom 2024.05 35M AlphaFoldDB
ESM-C 2024.12 300M, 600M, 6B UniRef, MGnify, JGI ×
ESM-GearNet 2023.10 - AlphaFoldDB
ESM-MSA-1b 2021.02 100M UniRef50
IgBert 2024.12 420M OAS ×
LM-GVP 2022.04 - -
OntoProtein 2022.06 - ProteinKG25
PeTriBERT 2022.08 40M AlphaFoldDB ×
PMLM 2021.10 87M-731M UniRef50, Pfam ×
PRoBERTa 2020.09 44M UniProtKB/Swiss-Prot
PromptProtein 2023.02 650M UniRef50, PDB
ProteinBERT 2022.03 16M UniRef90
ProteinLM 2021.12 200M/3B Pfam
ProtFlash 2023.10 79M/174M UniRef50
ProtTrans 2021.07 - UniRef, BFD
SaProt 2023.10 650M AlphaFoldDB, PDB
TCR-BERT 2021.11 100M PIRD, VDJdb, TCRdb, murine LCMV GP33

Decoder-only models

Model Time Params Pretraining Dataset Code
DARK 2022.01 128M - ×
IgLM 2022.12 13M -
PoET 2023.11 57M-604M - ×
ProGen 2020.03 1.2B UniParc, UniProtKB/Swiss-Prot
ProGen2 2023.10 151M-6.4B UniRef90, BFD30, PDB
ProLLaMA 2024.02 - UniRef50
ProtGPT2 2021.01 738M UniRef50
RITA 2022.05 1.2B UniRef100 ×
ZymCTRL 2022.01 738M BRENDA

Encoder-decoder models

Model Time Params Pretraining Dataset Code
Ankh 2023.01 450M/1.15B UniRef50 Link
IgT5 2024.12 3B OAS ×
LM-Design 2023.02 664M - ×
MSA-Augmenter 2023.06 260M UniRef50 Link
ProSST 2024.05 110M AlphaFoldDB, CATH Link
ProstT5 2023.07 3B AlphaFoldDB, PDB Link
ProtT5 2022.06 3B/11B UniRef50, BFD Link
pAbT5 2023.10 - OAS ×
Sapiens 2022.02 0.6M OAS Link
SS-pLM 2023.08 14.8M UniRef50 ×
xTrimoPGLM 2023.07 100B UniRef90, ColdFoldDB ×

Datasets

Protein datasets can be classified into two categories depending on whether they include annotations: pre-training datasets and benchmarks. Pre-training datasets are typically used for self-supervised pre-training as they lack labels, whereas benchmarks, which contain labeled data, are used for supervised fine-tuning or model evaluation. We provide the relevant papers and links for the pre-training datasets and benchmarks of the present popular protein language models in the following table(Pre trained datasets and benchmarks are sorted alphabetically by their names).The pre training datasets are divided into sequence datasets and structural datasets, and the benchmarks are divided into structural benchmarks, functional benchmarks, and other benchmarks.

Pre-training datasets

Sequence datasets

Dataset Time Scale Link
BFD[1,2,3] 2021.07 2.5B
BRENDA 2002.01 -
MGnify 2022.12 -
Pfam 2023.09 47M
UniClust30 2016.11 -
UniParc 2023.11 632M
UniProtKB/Swiss-Prot 2023.11 570K
UniProtKB/TrEMBL 2023.11 251M
UniRef50[1,2] 2023.11 53M
UniRef90[1,2] 2023.11 150M
UniRef100[1,2] 2023.11 314M

Structural datasets

Dataset Time Scale Link
AlphafoldDB[1,2] 2021.11 200M
PDB 2023.12 214K

Benchmarks

Structural benchmarks

Dataset Time Scale Link
CAMEO - -
CASP - -
CATH 2023.02 151M
SCOP 2023.01 914K

Functional benchmarks

Dataset Time Scale Link
CAFA - -
EC 2023.11 2.6M
FLIP 2022.01 320K
GO 2023.11 1.5M

Other benchmarks

Dataset Time Scale Link
PEER 2022.11 390K
ProteinGym 2022.12 300K
TAPE 2021.09 120K

Tools

We provide links to commonly used protein tools in the following table for readers to use(Tools are sorted alphabetically by their names).The tools are divided into sequence tools, structural tools, and other tools

Sequence tools

Tool Link
BLAST
HHblits&HHfilter
MMseq2

Structural tools

Tool Link
Foldseek
PyMOL
TM-align

Other tools

Tool Link
t-SNE
Umap