Segmentation is the task of identifying object boundarie—specifically, given an image and a point on an object, the goal is to produce a mask delineating that object's boundaries. Traditional segmentation methods often rely on category labels (e.g., “car” or “tree”). In contrast, we draw from developmental psychology the notion of Spelke objects—groupings of physical entities that reliably move together under applied forces, a concept first introduced by Liz Spelke in [Principles of Object Perception] (https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke1990-1.pdf). Defined by category-agnostic causal motion relationships, these segments reflect how objects interact and respond in the real world, making them especially relevant for physical reasoning and robotic manipulation.
SpelkeBench is a ~500-image evaluation dataset designed to assess whether segmentation algorithms can identify such segments. The dataset spans two complementary domains: high-resolution natural imagery sourced from EntitySeg and real-world robotic interaction scenes from Open X-Embodiment. Together, these domains support evaluation across both unconstrained natural scenes and structured physical environments. Example segments from SpelkeBench are displayed below, which we compare to SAM and EntitySeg segments to show that SpelkeBench contains annotated segments which align more with the Spelke notion.

SpelkeBench provides a standardized evaluation framework for Spelke segmentation with:
- ~500 images spanning natural and robotic scenes
- Ground truth segments which align with Spelke concept based on physical motion coherence
- Virtual poke points indicating where to apply segmentation queries (centroid)
Clone this repository and download the SpelkeBench dataset:
git clone https://github.com/neuroailab/SpelkeBench.git
cd SpelkeBench
bash download_spelke_bench.shThis will download spelke_bench.h5 to the datasets/ directory.
The dataset is provided as a single HDF5 file where each key corresponds to an image sample containing:
| Field | Description | Shape |
|---|---|---|
rgb |
Input RGB image | [H, W, 3] |
segment |
Ground truth Spelke segments | [N, H, W] |
centroid |
Virtual poke locations (x, y) | [N, 2] |
Where N is the number of ground truth segments/centroids for that image.
SpelkeBench provides tools to evaluate any segmentation model that can segment objects based on point prompts. The evaluation pipeline handles dataset loading, parallel inference, and metric computation.
conda create -n spelkebench python=3.10 -y
conda activate spelkebench
pip install -e .This installs the command-line utilities: spelkebench-infer, spelkebench-launch, and spelkebench-evaluate.
Create a model class that inherits from spelke_bench.models.segmentation_class.SegmentationModel:
from spelke_bench.models.segmentation_class import SegmentationModel
import numpy as np
class YourSegmentationModel(SegmentationModel):
def __init__(self):
"""Initialize your model, load weights, etc."""
super().__init__()
# Your initialization code here
def run_inference(self, input_image, poke_point):
"""
Perform segmentation based on a poke point.
Args:
input_image (np.ndarray): RGB image of shape [H, W, 3] with values in [0, 255] range
poke_point (tuple): (x, y) coordinates
Returns:
np.ndarray: Binary segmentation mask of shape [H, W] with values in {0, 1}
"""
# Your segmentation logic here
passFor quick testing or debugging on a subset of images:
spelkebench-infer \
--model_name your_model.SegmentationModel \
--dataset_path ./datasets/spelke_bench.h5 \
--output_dir ./results/my_model \
--device cuda:0 \
--img_names entityseg_1_image2926 entityseg_2_image1258For cluster environments with multiple nodes (e.g., 4 nodes with 4 GPUs each):
On each node, run:
spelkebench-launch \
--gpus 0 1 2 3 \
--dataset_path ./datasets/spelke_bench.h5 \
--output_dir ./results/my_model \
--num_splits 4 \
--split_num <node_id> \
--model_name your_model.SegmentationModelReplace <node_id> with 0, 1, 2, or 3 for each respective node.
Once inference is complete, compute metrics:
spelkebench-evaluate \
--input_dir ./results/my_model \
--output_dir ./results/my_model/metrics
--dataset_path ./datasets/spelke_bench.h5This will:
- Generate visual comparisons between predictions and ground truth
- Calculate and print Average Recall (AR) and mean IoU scores
- Save per-image metrics and visualizations
| Argument | Description |
|---|---|
--model_name |
Python path to your model class |
--dataset_path |
Path to spelke_bench.h5 |
--output_dir |
Directory for saving predictions |
--device |
GPU device (e.g., cuda:0) |
--img_names |
Specific image keys to process |
--gpus |
GPU IDs for parallel processing |
--num_splits |
Total splits for multi-node |
--split_num |
Current split index |