Efficient, scalable protein sequence search and embedding using Databricks, Spark, and ESM2 models.
Protein embeddings and search at scale. We use ESM2 models to embed sequence data from two sources, PDB and Uniref50. These datasets are quite large; Uniref50 contains in excess of 60M sequences. We show how one can use Spark to efficiently ingest large datasets in single text files to Delta Tables for efficient downstream processing, embed sequences using Protein Language Models (PLMs) at scale using multiple GPUs on Databricks, and finally how to build and search large-scale vector indices. For the vector search capability, we use Storage Optimized Vector Search on Databricks. We additionally provide a code to build a small frontend UI which is hosted on Databricks apps.
- Scalable protein embedding with ESM2 models and Spark
- Efficient ingestion of large datasets (PDB, Uniref50) to Delta Tables
- Distributed embedding using Databricks GPU clusters
- Large-scale vector search with Databricks Vector Search (standard & storage-optimized)
- Optional web UI for interactive search and alignment (Gradio, Databricks Apps)
- Configurable pipeline via
config.yaml
Before running the notebooks, edit config.yaml to customize:
unity_catalog:
catalog: "your_catalog" # Your Unity Catalog name
schema: "protein_search" # Schema to store tables/models
steps_included:
ingest_pdb100: true # Download PDB100 dataset (~1M sequences)
ingest_ur50: false # Download UniRef50 (~70M sequences)
sampling_percent_ur50: 5 # Sample % of UniRef50 for testing
embed_defaults:
models: # Choose ESM model(s)
- "facebook/esm2_t6_8M_UR50D" # Fast, less accurate
- "facebook/esm2_t30_150M_UR50D" # Balanced size/performance for search, slower inferenceπ‘ Tip: Start with PDB100 and the 8M model to validate your setup, then scale up.
- Clone this repository to your Databricks workspace
- Edit
config.yamlto set your catalog/schema and choose datasets (PDB100 and/or UniRef50) - Run the numbered notebooks in order (see detailed steps below)
- Access your search app via Databricks Apps
Estimated total time: 1-3 hours depending on dataset size and model choice
| Step | Script | Purpose | Compute Type | Time Estimate |
|---|---|---|---|---|
| 0 | 00_download_datasets.py |
Download PDB100/UniRef50 datasets | Single-node (16GB RAM) | 15-30 min |
| 1 | 01_register_esm_models.py |
Register ESM2 models to Unity Catalog | Single-node (16GB RAM) | 5-10 min |
| 2 | 02_process_raw_protein_datasets.py |
Parse FASTA files into Delta tables | Multi-node CPU (4x 4-core) | 10-15 min |
| 3 | 03_embed_protein_datasets_aiquery.py OR 03a_embed_protein_datasets_pandasudf.py OR SGC_embedding/ |
Generate embeddings | see below | see below |
| 4 | 04_build_vectorstores.py |
Create vector search indices | Serverless or single-node | 10-20 min |
| 5 | 05_search.py |
Test vector search queries | Serverless or single-node | < 5 min |
| 6 | 06_create_app_with_sdk.py |
Deploy Gradio app | Serverless | 5-10 min |
You have three options for generating embeddings:
Option A: 03_embed_protein_datasets_aiquery.ipynb (Recommended for small datasets, and beginners)
- β Simplest to use and maintain
- β Auto-scaling and managed infrastructure
- β Best for getting started quickly
- Use when: You want the easiest path to completion
Option B: 03a_embed_protein_datasets_pandasudf.py
- β More control over GPU allocation (but still balancing spark JVM needs)
- β Can still use streaming around this approach if desired as a spark workload
β οΈ Requires manual GPU cluster setup (multi-GPU T4 or better)β οΈ Needs tuning of batch sizes and partitions- Use when: You need fine-grained control or are processing very large datasets
Option C: SGC_embedding/ notebooks (Advanced)
- β Highest throughput for large-scale processing
- β Uses MosaicML streaming for optimal data loading
β οΈ Most complex - requires serverless GPU compute (beta)β οΈ Additional setup with MDS datasets- Use when: Processing 10M+ sequences or need maximum performance
In order for this example to run end-to-end fairly quickly, we used the ESM2_8M model by default.
- This model is not highly accurate, and in our testing we've found the 150M model to be a good balance of speed and performance.
- Other models with higher throughput exist and may be integrated here (please place an issue if you have any requests).
Databricks Vector Search offers two endpoint types. Choose based on your dataset size:
Standard Mode
- Best for: < 10M embeddings
- Latency: ~20-50ms
- Cost: Lower for smaller datasets
- Use case: PDB100, small-scale projects
Storage Optimized Mode (Used in this repo)
- Best for: > 10M embeddings
- Latency: ~300-500ms
- Cost: More cost-effective at scale
- Use case: Full UniRef50, large protein databases
Learn more: See Vector Search best practices for detailed performance benchmarks and scaling guidance.
Current implementation: Searches full-length protein sequences
Alternative approach: Split proteins into overlapping subsequences (chunks) before embedding
Tradeoffs:
| Approach | Pros | Cons |
|---|---|---|
| Full-length (current) | β
Simpler implementation β Captures global structure |
|
| Chunked sequences | β
Better local similarity β Finds conserved domains |
π¬ Interested in chunking? We're considering adding this feature. Vote or comment on the issue to help us prioritize!
See the LICENSE file.
In this repo we use several open source packages, models, and datasets and we are thankful to those who developed them. We list packages and datasets used below, noting that datasets are optionally downloaded by the user when running notebooks in this repo but not packaged with this repo directly and similarly for models.
- PDB100: We optionally download PDB100 (seqres) from the wwPDB.org, which releases data under their policy as with data available as CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
- Uniref50: We optionally download Uniref50 from UniProt under their licensing terms this data is under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. We thank the developers of Uniprot, for more details: [The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, 53 (2025)]
We currently use the ESM models and download these from Huggingface, available under the MIT licence. The models we use in the repo here can be altered to any of the models is the ESM2 family but we focused on the ones below to make faster building example code as runtime gets longer with larger models.
| Checkpoint name | Num layers | Num parameters | Source |
|---|---|---|---|
| esm2_t30_150M_UR50D | 30 | 150M | HuggingFace |
| esm2_t12_35M_UR50D | 12 | 35M | HuggingFace |
| esm2_t6_8M_UR50D | 6 | 8M | HuggingFace |
A list of packages used that are not part of the python standard library are included below. In addition, in app/src/msa.py we package some modified code from alphafold (2.3.2) which is under Apache 2.0 license, with changes stated in the file. The app also uses HMMER's jackhmmer binary for multiple sequence alignment, and the calm seafoam gradio theme which is under Apache 2.0 license.
| Tool | License | Purpose | Source |
|---|---|---|---|
| HMMER (jackhmmer) | BSD-3-Clause | Multiple sequence alignment in app | http://hmmer.org/ |
Note: HMMER is installed via apt-get during notebook execution and is not included in this repository. The jackhmmer binary is copied to Volumes for use by the Databricks app.
- Open an issue
