You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Protein language models (PLMs) are rapidly reshaping computational biology. As biological sequence and structure data scale up, NLP-style modeling techniques are becoming central to protein representation learning, prediction, and design.
This repository accompanies our systematic review of PLMs and provides a curated, model-centric knowledge base. It summarizes historical milestones, mainstream architectures, pretraining corpora, evaluation benchmarks, and practical toolchains. In addition to cataloging resources, we discuss key open challenges in this fast-moving area.
News
🌟 [2025/02] Our paper has been submitted to a preprint server.
We categorize PLMs into non-transformer-based and transformer-based families. Transformer models are further split into encoder-only, decoder-only, and encoder-decoder paradigms.
The tables below include publication links, release dates, parameter scales, backbone types, pretraining datasets, and open-source availability. Models are listed alphabetically within each category.
Depending on whether annotations are available, datasets are divided into pretraining datasets and benchmarks.
Pretraining datasets are typically unlabeled and used for self-supervised learning.
Benchmarks are labeled and used for fine-tuning and evaluation.
The following tables summarize widely used resources in PLM research. Entries are listed alphabetically and grouped into sequence/structural pretraining data and structural/functional/other benchmarks.