Protein-Language-Models

Protein language models (PLMs) are rapidly reshaping computational biology. As biological sequence and structure data scale up, NLP-style modeling techniques are becoming central to protein representation learning, prediction, and design.

This repository accompanies our systematic review of PLMs and provides a curated, model-centric knowledge base. It summarizes historical milestones, mainstream architectures, pretraining corpora, evaluation benchmarks, and practical toolchains. In addition to cataloging resources, we discuss key open challenges in this fast-moving area.

News

🌟 [2025/02] Our paper has been submitted to a preprint server.

Quick Start

Read the paper preprint: arXiv:2502.06881
Browse the curated PLM lists in Models, Datasets, and Tools below.
Use this repository as a reference index when selecting models, corpora, or evaluation suites for new projects.

Overview

The figure below summarizes the scope of the review.

Models

We categorize PLMs into non-transformer-based and transformer-based families. Transformer models are further split into encoder-only, decoder-only, and encoder-decoder paradigms.

The tables below include publication links, release dates, parameter scales, backbone types, pretraining datasets, and open-source availability. Models are listed alphabetically within each category.

Non-transformer-based models

Model	Time	Params	Base model	Pretraining Dataset	Code
CARP	2024.02	600K-640M	CNN	UniRef50	√
MIF-ST	2023.03	3.4M	GNN	CATH	√
ProSE	2021.06	-	LSTM	UniRef90, SCOP	√
ProtVec	2015.11	-	Skip-gram	UniProtKB/Swiss-Prot	×
ProtVecX	2019.03	-	ProtVec	UniRef50, UniProtKB/Swiss-Prot	×
SeqVec	2019.12	-	ELMo	UniRef50	√
Seq2vec	2020.09	-	CNN-LSTM	-	×
UDSMProt	2020.01	-	AWD-LSTM	UniProtKB/Swiss-Prot	×
UniRep	2019.10	-	mLSTM	UniRef50	√

Transformer-based models

Encoder-only models

Model	Time	Params	Pretraining Dataset	Code
AbLang	2022.06	-	OAS	√
AbLang2	2024.02	-	OAS	√
AMPLIFY	2024.09	120M/350M	UniRef50, UniRef100, OAS, SCOP	√
AminoBert	2022.10	-	UniRef90, PDB, MGnify	×
AntiBERTa	2022.07	86M	OAS	√
AntiBERTy	2021.12	26M	OAS	√
BALM	2024.05	-	OAS	√
DistilProtBert	2022.09	230M	UniRef50	√
ESM-1b	2020.02	650M	UniRef50	√
ESM-1v	2021.02	650M	UniRef90	√
ESM-2	2023.03	8M-15B	UniRef50	√
ESM-3	2024.07	98B	UniRef, MGnify, AlphaFoldDB, ESMAtlas	√
ESM All-Atom	2024.05	35M	AlphaFoldDB	√
ESM-C	2024.12	300M, 600M, 6B	UniRef, MGnify, JGI	×
ESM-GearNet	2023.10	-	AlphaFoldDB	√
ESM-MSA-1b	2021.02	100M	UniRef50	√
IgBert	2024.12	420M	OAS	×
LM-GVP	2022.04	-	-	√
OntoProtein	2022.06	-	ProteinKG25	√
PeTriBERT	2022.08	40M	AlphaFoldDB	×
PMLM	2021.10	87M-731M	UniRef50, Pfam	×
PRoBERTa	2020.09	44M	UniProtKB/Swiss-Prot	√
PromptProtein	2023.02	650M	UniRef50, PDB	√
ProteinBERT	2022.03	16M	UniRef90	√
ProteinLM	2021.12	200M/3B	Pfam	√
ProtFlash	2023.10	79M/174M	UniRef50	√
ProtTrans	2021.07	-	UniRef, BFD	√
SaProt	2023.10	650M	AlphaFoldDB, PDB	√
TCR-BERT	2021.11	100M	PIRD, VDJdb, TCRdb, murine LCMV GP33	√

Decoder-only models

Model	Time	Params	Pretraining Dataset	Code
DARK	2022.01	128M	-	×
IgLM	2022.12	13M	-	√
PoET	2023.11	57M-604M	-	×
ProGen	2020.03	1.2B	UniParc, UniProtKB/Swiss-Prot	√
ProGen2	2023.10	151M-6.4B	UniRef90, BFD30, PDB	√
ProLLaMA	2024.02	-	UniRef50	√
ProtGPT2	2021.01	738M	UniRef50	√
RITA	2022.05	1.2B	UniRef100	×
ZymCTRL	2022.01	738M	BRENDA	√

Encoder-decoder models

Model	Time	Params	Pretraining Dataset	Code
Ankh	2023.01	450M/1.15B	UniRef50	Link
IgT5	2024.12	3B	OAS	×
LM-Design	2023.02	664M	-	×
MSA-Augmenter	2023.06	260M	UniRef50	Link
ProSST	2024.05	110M	AlphaFoldDB, CATH	Link
ProstT5	2023.07	3B	AlphaFoldDB, PDB	Link
ProtT5	2022.06	3B/11B	UniRef50, BFD	Link
pAbT5	2023.10	-	OAS	×
Sapiens	2022.02	0.6M	OAS	Link
SS-pLM	2023.08	14.8M	UniRef50	×
xTrimoPGLM	2023.07	100B	UniRef90, ColdFoldDB	×

Datasets

Depending on whether annotations are available, datasets are divided into pretraining datasets and benchmarks.

Pretraining datasets are typically unlabeled and used for self-supervised learning.
Benchmarks are labeled and used for fine-tuning and evaluation.

The following tables summarize widely used resources in PLM research. Entries are listed alphabetically and grouped into sequence/structural pretraining data and structural/functional/other benchmarks.

Pre-training datasets

Sequence datasets

Dataset	Time	Scale	Link
BFD[1,2,3]	2021.07	2.5B	√
BRENDA	2002.01	-	√
MGnify	2022.12	-	√
Pfam	2023.09	47M	√
UniClust30	2016.11	-	√
UniParc	2023.11	632M	√
UniProtKB/Swiss-Prot	2023.11	570K	√
UniProtKB/TrEMBL	2023.11	251M	√
UniRef50[1,2]	2023.11	53M	√
UniRef90[1,2]	2023.11	150M	√
UniRef100[1,2]	2023.11	314M	√

Structural datasets

Dataset	Time	Scale	Link
AlphafoldDB[1,2]	2021.11	200M	√
PDB	2023.12	214K	√

Benchmarks

Structural benchmarks

Dataset	Time	Scale	Link
CAMEO	-	-	√
CASP	-	-	√
CATH	2023.02	151M	√
SCOP	2023.01	914K	√

Functional benchmarks

Dataset	Time	Scale	Link
CAFA	-	-	√
EC	2023.11	2.6M	√
FLIP	2022.01	320K	√
GO	2023.11	1.5M	√

Other benchmarks

Dataset	Time	Scale	Link
PEER	2022.11	390K	√
ProteinGym	2022.12	300K	√
TAPE	2021.09	120K	√

Tools

We also list commonly used protein research tools, grouped into sequence, structural, and other utilities. Entries are sorted alphabetically.

Sequence tools

Tool	Link
BLAST	√
HHblits&HHfilter	√
MMseq2	√

Structural tools

Tool	Link
Foldseek	√
PyMOL	√
TM-align	√

Other tools

Tool	Link
t-SNE	√
Umap	√

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein-Language-Models

News

Quick Start

Overview

Contents

Models

Non-transformer-based models

Transformer-based models

Encoder-only models

Decoder-only models

Encoder-decoder models

Datasets

Pre-training datasets

Sequence datasets

Structural datasets

Benchmarks

Structural benchmarks

Functional benchmarks

Other benchmarks

Tools

Sequence tools

Structural tools

Other tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Protein-Language-Models

News

Quick Start

Overview

Contents

Models

Non-transformer-based models

Transformer-based models

Encoder-only models

Decoder-only models

Encoder-decoder models

Datasets

Pre-training datasets

Sequence datasets

Structural datasets

Benchmarks

Structural benchmarks

Functional benchmarks

Other benchmarks

Tools

Sequence tools

Structural tools

Other tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages