Protein Search with Databricks Vector Search

Efficient, scalable protein sequence search and embedding using Databricks, Spark, and ESM2 models.

Protein embeddings and search at scale. We use ESM2 models to embed sequence data from two sources, PDB and Uniref50. These datasets are quite large; Uniref50 contains in excess of 60M sequences. We show how one can use Spark to efficiently ingest large datasets in single text files to Delta Tables for efficient downstream processing, embed sequences using Protein Language Models (PLMs) at scale using multiple GPUs on Databricks, and finally how to build and search large-scale vector indices. For the vector search capability, we use Storage Optimized Vector Search on Databricks. We additionally provide a code to build a small frontend UI which is hosted on Databricks apps.

🚀 Features

Scalable protein embedding with ESM2 models and Spark
Efficient ingestion of large datasets (PDB, Uniref50) to Delta Tables
Distributed embedding using Databricks GPU clusters
Large-scale vector search with Databricks Vector Search (standard & storage-optimized)
Optional web UI for interactive search and alignment (Gradio, Databricks Apps)
Configurable pipeline via config.yaml

📖 Getting Started

Configuration

Before running the notebooks, edit config.yaml to customize:

unity_catalog:
  catalog: "your_catalog"      # Your Unity Catalog name
  schema: "protein_search"     # Schema to store tables/models
  
steps_included:
  ingest_pdb100: true           # Download PDB100 dataset (~1M sequences)
  ingest_ur50: false            # Download UniRef50 (~70M sequences)
  sampling_percent_ur50: 5      # Sample % of UniRef50 for testing

embed_defaults:
  models:                       # Choose ESM model(s)
    - "facebook/esm2_t6_8M_UR50D"     # Fast, less accurate
    - "facebook/esm2_t30_150M_UR50D"  # Balanced size/performance for search, slower inference

💡 Tip: Start with PDB100 and the 8M model to validate your setup, then scale up.

Quick Start

Clone this repository to your Databricks workspace
Edit config.yaml to set your catalog/schema and choose datasets (PDB100 and/or UniRef50)
Run the numbered notebooks in order (see detailed steps below)
Access your search app via Databricks Apps

Estimated total time: 1-3 hours depending on dataset size and model choice

Pipeline Steps

Step	Script	Purpose	Compute Type	Time Estimate
0	`00_download_datasets.py`	Download PDB100/UniRef50 datasets	Single-node (16GB RAM)	15-30 min
1	`01_register_esm_models.py`	Register ESM2 models to Unity Catalog	Single-node (16GB RAM)	5-10 min
2	`02_process_raw_protein_datasets.py`	Parse FASTA files into Delta tables	Multi-node CPU (4x 4-core)	10-15 min
3	`03_embed_protein_datasets_aiquery.py` OR `03a_embed_protein_datasets_pandasudf.py` OR `SGC_embedding/`	Generate embeddings	see below	see below
4	`04_build_vectorstores.py`	Create vector search indices	Serverless or single-node	10-20 min
5	`05_search.py`	Test vector search queries	Serverless or single-node	< 5 min
6	`06_create_app_with_sdk.py`	Deploy Gradio app	Serverless	5-10 min

Choosing Your Embedding Method

You have three options for generating embeddings:

Option A: 03_embed_protein_datasets_aiquery.ipynb (Recommended for small datasets, and beginners)

✅ Simplest to use and maintain
✅ Auto-scaling and managed infrastructure
✅ Best for getting started quickly
Use when: You want the easiest path to completion

Option B: 03a_embed_protein_datasets_pandasudf.py

✅ More control over GPU allocation (but still balancing spark JVM needs)
✅ Can still use streaming around this approach if desired as a spark workload
⚠️ Requires manual GPU cluster setup (multi-GPU T4 or better)
⚠️ Needs tuning of batch sizes and partitions
Use when: You need fine-grained control or are processing very large datasets

Option C: SGC_embedding/ notebooks (Advanced)

✅ Highest throughput for large-scale processing
✅ Uses MosaicML streaming for optimal data loading
⚠️ Most complex - requires serverless GPU compute (beta)
⚠️ Additional setup with MDS datasets
Use when: Processing 10M+ sequences or need maximum performance

Notes on speed and accuracy

In order for this example to run end-to-end fairly quickly, we used the ESM2_8M model by default.

This model is not highly accurate, and in our testing we've found the 150M model to be a good balance of speed and performance.
Other models with higher throughput exist and may be integrated here (please place an issue if you have any requests).

Databricks Vector Search offers two endpoint types. Choose based on your dataset size:

Standard Mode

Best for: < 10M embeddings
Latency: ~20-50ms
Cost: Lower for smaller datasets
Use case: PDB100, small-scale projects

Storage Optimized Mode (Used in this repo)

Best for: > 10M embeddings
Latency: ~300-500ms
Cost: More cost-effective at scale
Use case: Full UniRef50, large protein databases

Learn more: See Vector Search best practices for detailed performance benchmarks and scaling guidance.

Advanced: Protein Chunking

Current implementation: Searches full-length protein sequences

Alternative approach: Split proteins into overlapping subsequences (chunks) before embedding

Tradeoffs:

Approach	Pros	Cons
Full-length (current)	✅ Simpler implementation ✅ Captures global structure	⚠️ May miss local motif similarity
Chunked sequences	✅ Better local similarity ✅ Finds conserved domains	⚠️ More complex pipeline ⚠️ Larger index size

🔬 Interested in chunking? We're considering adding this feature. Vote or comment on the issue to help us prioritize!

📄 License

See the LICENSE file.

📦 Open Source Packages, Models, and Datasets

In this repo we use several open source packages, models, and datasets and we are thankful to those who developed them. We list packages and datasets used below, noting that datasets are optionally downloaded by the user when running notebooks in this repo but not packaged with this repo directly and similarly for models.

Datasets:

PDB100: We optionally download PDB100 (seqres) from the wwPDB.org, which releases data under their policy as with data available as CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
Uniref50: We optionally download Uniref50 from UniProt under their licensing terms this data is under the Creative Commons Attribution 4.0 International (CC BY 4.0) License. We thank the developers of Uniprot, for more details: [The UniProt Consortium , UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, 53 (2025)]

Models:

We currently use the ESM models and download these from Huggingface, available under the MIT licence. The models we use in the repo here can be altered to any of the models is the ESM2 family but we focused on the ones below to make faster building example code as runtime gets longer with larger models.

Checkpoint name	Num layers	Num parameters	Source
esm2_t30_150M_UR50D	30	150M	HuggingFace
esm2_t12_35M_UR50D	12	35M	HuggingFace
esm2_t6_8M_UR50D	6	8M	HuggingFace

Packages:

A list of packages used that are not part of the python standard library are included below. In addition, in app/src/msa.py we package some modified code from alphafold (2.3.2) which is under Apache 2.0 license, with changes stated in the file. The app also uses HMMER's jackhmmer binary for multiple sequence alignment, and the calm seafoam gradio theme which is under Apache 2.0 license.

Package	License	Source
PySpark	Apache 2.0	https://github.com/apache/spark
sentence_transformers	Apache 2.0	https://github.com/UKPLab/sentence-transformers
mlflow	Apache 2.0	https://github.com/mlflow/mlflow
databricks-vectorsearch	DBLicense	https://pypi.org/project/databricks-vectorsearch/
databricks-sdk	Apache 2.0	https://github.com/databricks/databricks-sdk-py
pandas	BSD 3-Clause	https://github.com/pandas-dev/pandas
delta	Apache 2.0	https://github.com/delta-io/delta
numpy	BSD 3-Clause	https://github.com/numpy/numpy
biopython	Biopython License	https://github.com/biopython/biopython
gradio	Apache 2.0	https://github.com/gradio-app/gradio
torch	BSD 3-Clause	https://github.com/pytorch/pytorch
torchvision	BSD 3-Clause	https://github.com/pytorch/vision
transformers	Apache 2.0	https://github.com/huggingface/transformers
datasets	Apache 2.0	https://github.com/huggingface/datasets
mosaicml-streaming	Apache 2.0	https://github.com/mosaicml/streaming
pyarrow	Apache 2.0	https://github.com/apache/arrow
requests	Apache 2.0	https://github.com/psf/requests
cloudpickle	BSD 3-Clause	https://github.com/cloudpipe/cloudpickle
PyYAML	MIT	https://github.com/yaml/pyyaml

System Tools & Binaries:

Tool	License	Purpose	Source
HMMER (jackhmmer)	BSD-3-Clause	Multiple sequence alignment in app	http://hmmer.org/

Note: HMMER is installed via apt-get during notebook execution and is not included in this repository. The jackhmmer binary is copied to Volumes for use by the Databricks app.

🙋‍♀️ Questions? Feedback?

Open an issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein Search with Databricks Vector Search

🚀 Features

📖 Getting Started

Configuration

Quick Start

Pipeline Steps

Choosing Your Embedding Method

Notes on speed and accuracy

Advanced: Protein Chunking

📄 License

📦 Open Source Packages, Models, and Datasets

Datasets:

Models:

Packages:

System Tools & Binaries:

🙋‍♀️ Questions? Feedback?

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SGC_embedding		SGC_embedding
app/src		app/src
static		static
.gitignore		.gitignore
00_download_datasets.py		00_download_datasets.py
01_register_esm_models.py		01_register_esm_models.py
02_process_raw_protein_datasets.py		02_process_raw_protein_datasets.py
03_embed_protein_datasets_aiquery.py		03_embed_protein_datasets_aiquery.py
03a_embed_protein_datasets_pandasudf.py		03a_embed_protein_datasets_pandasudf.py
04_build_vectorstores.py		04_build_vectorstores.py
05_search.py		05_search.py
06_create_app_with_sdk.py		06_create_app_with_sdk.py
CODEOWNERS.txt		CODEOWNERS.txt
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
config.yaml		config.yaml
processing.py		processing.py

License

databricks-solutions/proteinsearch

Folders and files

Latest commit

History

Repository files navigation

Protein Search with Databricks Vector Search

🚀 Features

📖 Getting Started

Configuration

Quick Start

Pipeline Steps

Choosing Your Embedding Method

Notes on speed and accuracy

Advanced: Protein Chunking

📄 License

📦 Open Source Packages, Models, and Datasets

Datasets:

Models:

Packages:

System Tools & Binaries:

🙋‍♀️ Questions? Feedback?

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages