Skip to content

databricks-solutions/proteinsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Protein Search with Databricks Vector Search

Efficient, scalable protein sequence search and embedding using Databricks, Spark, and ESM2 models.

Protein embeddings and search at scale. We use ESM2 models to embed sequence data from two sources, PDB and Uniref50. These datasets are quite large; Uniref50 contains in excess of 60M sequences. We show how one can use Spark to efficiently ingest large datasets in single text files to Delta Tables for efficient downstream processing, embed sequences using Protein Language Models (PLMs) at scale using multiple GPUs on Databricks, and finally how to build and search large-scale vector indices. For the vector search capability, we use Storage Optimized Vector Search on Databricks. We additionally provide a code to build a small frontend UI which is hosted on Databricks apps.

App UI screenshot


πŸš€ Features

  • Scalable protein embedding with ESM2 models and Spark
  • Efficient ingestion of large datasets (PDB, Uniref50) to Delta Tables
  • Distributed embedding using Databricks GPU clusters
  • Large-scale vector search with Databricks Vector Search (standard & storage-optimized)
  • Optional web UI for interactive search and alignment (Gradio, Databricks Apps)
  • Configurable pipeline via config.yaml

πŸ“– Getting Started

Configuration

Before running the notebooks, edit config.yaml to customize:

unity_catalog:
  catalog: "your_catalog"      # Your Unity Catalog name
  schema: "protein_search"     # Schema to store tables/models
  
steps_included:
  ingest_pdb100: true           # Download PDB100 dataset (~1M sequences)
  ingest_ur50: false            # Download UniRef50 (~70M sequences)
  sampling_percent_ur50: 5      # Sample % of UniRef50 for testing

embed_defaults:
  models:                       # Choose ESM model(s)
    - "facebook/esm2_t6_8M_UR50D"     # Fast, less accurate
    - "facebook/esm2_t30_150M_UR50D"  # Balanced size/performance for search, slower inference

πŸ’‘ Tip: Start with PDB100 and the 8M model to validate your setup, then scale up.

Quick Start

  1. Clone this repository to your Databricks workspace
  2. Edit config.yaml to set your catalog/schema and choose datasets (PDB100 and/or UniRef50)
  3. Run the numbered notebooks in order (see detailed steps below)
  4. Access your search app via Databricks Apps

Estimated total time: 1-3 hours depending on dataset size and model choice

Pipeline Steps

Step Script Purpose Compute Type Time Estimate
0 00_download_datasets.py Download PDB100/UniRef50 datasets Single-node (16GB RAM) 15-30 min
1 01_register_esm_models.py Register ESM2 models to Unity Catalog Single-node (16GB RAM) 5-10 min
2 02_process_raw_protein_datasets.py Parse FASTA files into Delta tables Multi-node CPU (4x 4-core) 10-15 min
3 03_embed_protein_datasets_aiquery.py OR 03a_embed_protein_datasets_pandasudf.py OR SGC_embedding/ Generate embeddings see below see below
4 04_build_vectorstores.py Create vector search indices Serverless or single-node 10-20 min
5 05_search.py Test vector search queries Serverless or single-node < 5 min
6 06_create_app_with_sdk.py Deploy Gradio app Serverless 5-10 min

Choosing Your Embedding Method

You have three options for generating embeddings:

Option A: 03_embed_protein_datasets_aiquery.ipynb (Recommended for small datasets, and beginners)

  • βœ… Simplest to use and maintain
  • βœ… Auto-scaling and managed infrastructure
  • βœ… Best for getting started quickly
  • Use when: You want the easiest path to completion

Option B: 03a_embed_protein_datasets_pandasudf.py

  • βœ… More control over GPU allocation (but still balancing spark JVM needs)
  • βœ… Can still use streaming around this approach if desired as a spark workload
  • ⚠️ Requires manual GPU cluster setup (multi-GPU T4 or better)
  • ⚠️ Needs tuning of batch sizes and partitions
  • Use when: You need fine-grained control or are processing very large datasets

Option C: SGC_embedding/ notebooks (Advanced)

  • βœ… Highest throughput for large-scale processing
  • βœ… Uses MosaicML streaming for optimal data loading
  • ⚠️ Most complex - requires serverless GPU compute (beta)
  • ⚠️ Additional setup with MDS datasets
  • Use when: Processing 10M+ sequences or need maximum performance

Notes on speed and accuracy

In order for this example to run end-to-end fairly quickly, we used the ESM2_8M model by default.

  • This model is not highly accurate, and in our testing we've found the 150M model to be a good balance of speed and performance.
  • Other models with higher throughput exist and may be integrated here (please place an issue if you have any requests).

Databricks Vector Search offers two endpoint types. Choose based on your dataset size:

Standard Mode

  • Best for: < 10M embeddings
  • Latency: ~20-50ms
  • Cost: Lower for smaller datasets
  • Use case: PDB100, small-scale projects

Storage Optimized Mode (Used in this repo)

  • Best for: > 10M embeddings
  • Latency: ~300-500ms
  • Cost: More cost-effective at scale
  • Use case: Full UniRef50, large protein databases

Learn more: See Vector Search best practices for detailed performance benchmarks and scaling guidance.

Advanced: Protein Chunking

Current implementation: Searches full-length protein sequences

Alternative approach: Split proteins into overlapping subsequences (chunks) before embedding

Tradeoffs:

Approach Pros Cons
Full-length (current) βœ… Simpler implementation
βœ… Captures global structure
⚠️ May miss local motif similarity
Chunked sequences βœ… Better local similarity
βœ… Finds conserved domains
⚠️ More complex pipeline
⚠️ Larger index size

πŸ”¬ Interested in chunking? We're considering adding this feature. Vote or comment on the issue to help us prioritize!


πŸ“„ License

See the LICENSE file.


πŸ“¦ Open Source Packages, Models, and Datasets

In this repo we use several open source packages, models, and datasets and we are thankful to those who developed them. We list packages and datasets used below, noting that datasets are optionally downloaded by the user when running notebooks in this repo but not packaged with this repo directly and similarly for models.

Datasets:

Models:

We currently use the ESM models and download these from Huggingface, available under the MIT licence. The models we use in the repo here can be altered to any of the models is the ESM2 family but we focused on the ones below to make faster building example code as runtime gets longer with larger models.

Checkpoint name Num layers Num parameters Source
esm2_t30_150M_UR50D 30 150M HuggingFace
esm2_t12_35M_UR50D 12 35M HuggingFace
esm2_t6_8M_UR50D 6 8M HuggingFace

Packages:

A list of packages used that are not part of the python standard library are included below. In addition, in app/src/msa.py we package some modified code from alphafold (2.3.2) which is under Apache 2.0 license, with changes stated in the file. The app also uses HMMER's jackhmmer binary for multiple sequence alignment, and the calm seafoam gradio theme which is under Apache 2.0 license.

Package License Source
PySpark Apache 2.0 https://github.com/apache/spark
sentence_transformers Apache 2.0 https://github.com/UKPLab/sentence-transformers
mlflow Apache 2.0 https://github.com/mlflow/mlflow
databricks-vectorsearch DBLicense https://pypi.org/project/databricks-vectorsearch/
databricks-sdk Apache 2.0 https://github.com/databricks/databricks-sdk-py
pandas BSD 3-Clause https://github.com/pandas-dev/pandas
delta Apache 2.0 https://github.com/delta-io/delta
numpy BSD 3-Clause https://github.com/numpy/numpy
biopython Biopython License https://github.com/biopython/biopython
gradio Apache 2.0 https://github.com/gradio-app/gradio
torch BSD 3-Clause https://github.com/pytorch/pytorch
torchvision BSD 3-Clause https://github.com/pytorch/vision
transformers Apache 2.0 https://github.com/huggingface/transformers
datasets Apache 2.0 https://github.com/huggingface/datasets
mosaicml-streaming Apache 2.0 https://github.com/mosaicml/streaming
pyarrow Apache 2.0 https://github.com/apache/arrow
requests Apache 2.0 https://github.com/psf/requests
cloudpickle BSD 3-Clause https://github.com/cloudpipe/cloudpickle
PyYAML MIT https://github.com/yaml/pyyaml

System Tools & Binaries:

Tool License Purpose Source
HMMER (jackhmmer) BSD-3-Clause Multiple sequence alignment in app http://hmmer.org/

Note: HMMER is installed via apt-get during notebook execution and is not included in this repository. The jackhmmer binary is copied to Volumes for use by the Databricks app.


πŸ™‹β€β™€οΈ Questions? Feedback?


About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •