Skip to content

ISYSLAB-HUST/Protein-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 

Repository files navigation

Protein-Language-Models

Protein language models (PLMs) are rapidly reshaping computational biology. As biological sequence and structure data scale up, NLP-style modeling techniques are becoming central to protein representation learning, prediction, and design.

This repository accompanies our systematic review of PLMs and provides a curated, model-centric knowledge base. It summarizes historical milestones, mainstream architectures, pretraining corpora, evaluation benchmarks, and practical toolchains. In addition to cataloging resources, we discuss key open challenges in this fast-moving area.

News

Quick Start

  • Read the paper preprint: arXiv:2502.06881
  • Browse the curated PLM lists in Models, Datasets, and Tools below.
  • Use this repository as a reference index when selecting models, corpora, or evaluation suites for new projects.

Overview

The figure below summarizes the scope of the review.

Protein-Language-Models-Overview

Contents

Models

We categorize PLMs into non-transformer-based and transformer-based families. Transformer models are further split into encoder-only, decoder-only, and encoder-decoder paradigms.

The tables below include publication links, release dates, parameter scales, backbone types, pretraining datasets, and open-source availability. Models are listed alphabetically within each category.

Non-transformer-based models

Model Time Params Base model Pretraining Dataset Code
CARP 2024.02 600K-640M CNN UniRef50
MIF-ST 2023.03 3.4M GNN CATH
ProSE 2021.06 - LSTM UniRef90, SCOP
ProtVec 2015.11 - Skip-gram UniProtKB/Swiss-Prot ×
ProtVecX 2019.03 - ProtVec UniRef50, UniProtKB/Swiss-Prot ×
SeqVec 2019.12 - ELMo UniRef50
Seq2vec 2020.09 - CNN-LSTM - ×
UDSMProt 2020.01 - AWD-LSTM UniProtKB/Swiss-Prot ×
UniRep 2019.10 - mLSTM UniRef50

Transformer-based models

Encoder-only models

Model Time Params Pretraining Dataset Code
AbLang 2022.06 - OAS
AbLang2 2024.02 - OAS
AMPLIFY 2024.09 120M/350M UniRef50, UniRef100, OAS, SCOP
AminoBert 2022.10 - UniRef90, PDB, MGnify ×
AntiBERTa 2022.07 86M OAS
AntiBERTy 2021.12 26M OAS
BALM 2024.05 - OAS
DistilProtBert 2022.09 230M UniRef50
ESM-1b 2020.02 650M UniRef50
ESM-1v 2021.02 650M UniRef90
ESM-2 2023.03 8M-15B UniRef50
ESM-3 2024.07 98B UniRef, MGnify, AlphaFoldDB, ESMAtlas
ESM All-Atom 2024.05 35M AlphaFoldDB
ESM-C 2024.12 300M, 600M, 6B UniRef, MGnify, JGI ×
ESM-GearNet 2023.10 - AlphaFoldDB
ESM-MSA-1b 2021.02 100M UniRef50
IgBert 2024.12 420M OAS ×
LM-GVP 2022.04 - -
OntoProtein 2022.06 - ProteinKG25
PeTriBERT 2022.08 40M AlphaFoldDB ×
PMLM 2021.10 87M-731M UniRef50, Pfam ×
PRoBERTa 2020.09 44M UniProtKB/Swiss-Prot
PromptProtein 2023.02 650M UniRef50, PDB
ProteinBERT 2022.03 16M UniRef90
ProteinLM 2021.12 200M/3B Pfam
ProtFlash 2023.10 79M/174M UniRef50
ProtTrans 2021.07 - UniRef, BFD
SaProt 2023.10 650M AlphaFoldDB, PDB
TCR-BERT 2021.11 100M PIRD, VDJdb, TCRdb, murine LCMV GP33

Decoder-only models

Model Time Params Pretraining Dataset Code
DARK 2022.01 128M - ×
IgLM 2022.12 13M -
PoET 2023.11 57M-604M - ×
ProGen 2020.03 1.2B UniParc, UniProtKB/Swiss-Prot
ProGen2 2023.10 151M-6.4B UniRef90, BFD30, PDB
ProLLaMA 2024.02 - UniRef50
ProtGPT2 2021.01 738M UniRef50
RITA 2022.05 1.2B UniRef100 ×
ZymCTRL 2022.01 738M BRENDA

Encoder-decoder models

Model Time Params Pretraining Dataset Code
Ankh 2023.01 450M/1.15B UniRef50 Link
IgT5 2024.12 3B OAS ×
LM-Design 2023.02 664M - ×
MSA-Augmenter 2023.06 260M UniRef50 Link
ProSST 2024.05 110M AlphaFoldDB, CATH Link
ProstT5 2023.07 3B AlphaFoldDB, PDB Link
ProtT5 2022.06 3B/11B UniRef50, BFD Link
pAbT5 2023.10 - OAS ×
Sapiens 2022.02 0.6M OAS Link
SS-pLM 2023.08 14.8M UniRef50 ×
xTrimoPGLM 2023.07 100B UniRef90, ColdFoldDB ×

Datasets

Depending on whether annotations are available, datasets are divided into pretraining datasets and benchmarks.

  • Pretraining datasets are typically unlabeled and used for self-supervised learning.
  • Benchmarks are labeled and used for fine-tuning and evaluation.

The following tables summarize widely used resources in PLM research. Entries are listed alphabetically and grouped into sequence/structural pretraining data and structural/functional/other benchmarks.

Pre-training datasets

Sequence datasets

Dataset Time Scale Link
BFD[1,2,3] 2021.07 2.5B
BRENDA 2002.01 -
MGnify 2022.12 -
Pfam 2023.09 47M
UniClust30 2016.11 -
UniParc 2023.11 632M
UniProtKB/Swiss-Prot 2023.11 570K
UniProtKB/TrEMBL 2023.11 251M
UniRef50[1,2] 2023.11 53M
UniRef90[1,2] 2023.11 150M
UniRef100[1,2] 2023.11 314M

Structural datasets

Dataset Time Scale Link
AlphafoldDB[1,2] 2021.11 200M
PDB 2023.12 214K

Benchmarks

Structural benchmarks

Dataset Time Scale Link
CAMEO - -
CASP - -
CATH 2023.02 151M
SCOP 2023.01 914K

Functional benchmarks

Dataset Time Scale Link
CAFA - -
EC 2023.11 2.6M
FLIP 2022.01 320K
GO 2023.11 1.5M

Other benchmarks

Dataset Time Scale Link
PEER 2022.11 390K
ProteinGym 2022.12 300K
TAPE 2021.09 120K

Tools

We also list commonly used protein research tools, grouped into sequence, structural, and other utilities. Entries are sorted alphabetically.

Sequence tools

Tool Link
BLAST
HHblits&HHfilter
MMseq2

Structural tools

Tool Link
Foldseek
PyMOL
TM-align

Other tools

Tool Link
t-SNE
Umap

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors