EmmaEmb

EmmaEmb is a Python library designed to facilitate the initial comparison of diverse embedding spaces in embeddings for molecular biology. By incorporating user-defined feature data on the natural grouping of data points, EmmaEmb enables users to compare global statistics and understand the differences in clustering of natural groupings across different embedding spaces.

Although designed for the application on embeddings of molecular biology data (e.g. protein sequences), the library is general and can be applied to any type of embedding space.

Overview

Workflow
Input
Features
Installation
Scripts for protein language model embeddings
License

Workflow

The following figure provides an overview of the EmmaEmb workflow:

EmmaEmb enables the comparative analysis of information captured in different embedding spaces. The workflow consists of the following steps:

A. Embedding Generation: Starting with a set of samples (e.g., proteins or genes), embeddings are extracted from multiple foundation models, which may differ in architecture or training.

B. Feature Integration: Sample-specific categorical data (e.g., functional annotations, protein families) is incorporated to the analysis.

C. Feature Distribution Analysis: The distribution of categorical features is assessed within local neighborhoods in each embedding space, using k-nearest neighbors to quantify class consistency and overlap.

D. Pairwise Space Comparison: Embedding spaces are compared based on pairwise distances and neighborhood similarity to identify global and local differences. Regions with high divergence can be further examined using feature data to understand variations in model representation.

Input

EmmaEmb is centered around the Emma object, which serves as the core of the library. The following input data is required:

Feature Data: A pandas DataFrame containing sample-specific categorical features. Each row corresponds to a sample, and each column corresponds to a feature. The first column should contain the sample IDs.
Embedding Spaces: Precomputed embeddings for each sample (scripts for generating embeddings from protein language models are provided). Embeddings should be stored in a directory with one file per sample. The file name should correspond to the sample ID, and the file should contain the embedding as a list of floats. Multiple embedding spaces can be added to the Emma object for comparison. Dimensions do not need to match across spaces.

The Emma object is initialized with feature data and embedding spaces can be added incrementally.

Features

Visualisation after dimensionality reduction

EmmaEmb supports dimensionality reduction techniques such as PCA, t-SNE, and UMAP to visualize and analyze high-dimensional embeddings in lower-dimensional spaces. The plots can be colour coded by a feature of interest from the feature data.

Computation of pairwise distances

To make embedding spaces comparable, EmmaEmb analyses rely on comparing not individual embeddings, but the relationships between them. The library calculates pairwise distances between samples in each embedding space. Users can select from multiple distance metrics, including:

Euclidean
Cosine
Manhattan

For parts of the analysis only the k-nearest neighbors are considered, which will be based on the pairwise distances. The pairwise distances are only calculated once and can be reused for multiple analyses.

The distances can be visually inspected in a heatmap.

Feature distribution across spaces

For a selected feature from the feature data, EmmaEmb provides two metrics to assess the alignment of features across embedding spaces:

KNN feature alignment scores: Quantify the alignment of features by examining the nearest neighbors of each sample in different spaces. This score reveals the extent to which samples with a shared feature are embedded close to each other in different spaces.
KNN class similarity matrix: Measure the consistency of class-level relationships by assessing the overlap of nearest neighbors for samples within the same class across spaces. This provides insights into the relationships between classes in different embedding spaces.

Pairwise space comparison

EmmaEmb provides two metrics to directly compare two embedding spaces:

Global comparison of pairwise distances: Compare the distribution of pairwise distances between samples in two embedding spaces. This metric is useful for assessing the overall similarity of the two spaces. The pairwise distances can also be visualized in a scatter plot.
Cross-space neighborhood similarity: Assess the similarity of local neighborhoods in two embedding spaces. This metric is useful for identifying regions where the two spaces diverge. The similarity is calculated based on the overlap of k-nearest neighbors between samples in the two spaces. The regions of divergence can be characterized using the feature data.

Installation

You can install the EmmaEmb library through pip, or access examples locally by cloning the github repo.

Installing the EmmaEmb library

pip install emmaemb

Cloning the EmmaEmb repo

git clone https://github.com/broadinstitute/EmmaEmb

cd emmaemb                 # enter project directory
pip3 install .                 # install dependencies
jupyter lab colab_notebooks    # open notebook examples in jupyter for local exploration

Getting Started

To get started with the EmmaEmb library, load the metadata and embeddings, and initialize the Emma object. The following code snippet demonstrates how to use EmmaEmb to compare two embedding spaces:

from emmaemb import Emma
from emmaemb.vizualization import *

# Initialize Emma object with feature data
emma = Emma(feature_data=feature_data)

# Add embedding spaces
emma.add_embedding_space("ProtT5", "embeddings/prot_t5_embeddings")
emma.add_embedding_space("ESM2", "embeddings/esm2_embeddings")

# Compute pairwise distances
emma.calculate_pairwise_distances("ProtT5", "cosine")
emma.calculate_pairwise_distances("ESM2", "cosine")

# Plot space after dimensionality reduction
fig_1 = plot_emb_space(
    emma, emb_space="ProtT5", color_by="enzyme_class", method="PCA"
)

# Analyze global comparison of pairwise distances
fig_2 = plot_pairwise_distance_comparison(
    emma, emb_space_x="ProtT5", emb_space_y="ESM2", metric="cosine", group_by="species"
)

# Analyze feature distribution across spaces
fig_3 = plot_knn_alignment_across_embedding_spaces(
    emma, feature="enzyme_class", k=10, metric="cosine"
)

A more detailed example can be found in the notebook.

Scripts for protein language model embeddings

The repository also contains a wrapper script for retrieving protein embeddings from a diverse set of pre-trained Protein Language Models.

The script includes a heuristic to chunk and aggregate long sequences to ensure compatibility with the models' input size constraints.

The script supports the following models:

Contact

If you have any questions or suggestions, please feel free to reach out to the authors: [email protected].

More information about the library can be found in our pre-print on bioRxiv: Decoding protein language models: insights from embedding space analysis.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name	Name	Last commit message	Last commit date
Latest commit pia-francesca update link to pre-print Feb 12, 2025 b3695d6 · Feb 12, 2025 History 130 Commits
.github/workflows	.github/workflows	update link to broad github repo	Jun 22, 2024
emmaemb	emmaemb	add colab example	Feb 10, 2025
examples	examples	Update colab	Feb 10, 2025
images	images	remove figures from repo	Jan 29, 2025
plm_embeddings	plm_embeddings	clean up repo	Jan 29, 2025
.gitignore	.gitignore	update readme	Jan 29, 2025
LICENSE	LICENSE	Create LICENSE	Jun 4, 2024
README.md	README.md	update link to pre-print	Feb 12, 2025
pyproject.toml	pyproject.toml	flexible versions	Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmmaEmb

Overview

Workflow

Input

Features

Visualisation after dimensionality reduction

Computation of pairwise distances

Feature distribution across spaces

Pairwise space comparison

Installation

Installing the EmmaEmb library

Cloning the EmmaEmb repo

Getting Started

Scripts for protein language model embeddings

Contact

License

About

Releases

Packages

Contributors 2

Languages

License

broadinstitute/EmmaEmb

Folders and files

Latest commit

History

Repository files navigation

EmmaEmb

Overview

Workflow

Input

Features

Visualisation after dimensionality reduction

Computation of pairwise distances

Feature distribution across spaces

Pairwise space comparison

Installation

Installing the EmmaEmb library

Cloning the EmmaEmb repo

Getting Started

Scripts for protein language model embeddings

Contact

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages