GitHub

SpelkeBench: Benchmarking Spelke Segmentation

Rahul Venkatesh^*1, Klemen Kotar^*1, Lilian Naing Chen^*1, Seungwoo Kim¹, Luca Thomas Wheeler¹, Jared Watrous¹,

Ashley Xu¹, Gia Ancone¹, Wanhee Lee¹, Honglin Chen², Daniel Bear³, Stefan Stojanov¹, Daniel Yamins¹

The SpelkeBench Benchmark

Segmentation is the task of identifying object boundarie—specifically, given an image and a point on an object, the goal is to produce a mask delineating that object's boundaries. Traditional segmentation methods often rely on category labels (e.g., “car” or “tree”). In contrast, we draw from developmental psychology the notion of Spelke objects—groupings of physical entities that reliably move together under applied forces, a concept first introduced by Liz Spelke in [Principles of Object Perception] (https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke1990-1.pdf). Defined by category-agnostic causal motion relationships, these segments reflect how objects interact and respond in the real world, making them especially relevant for physical reasoning and robotic manipulation.

SpelkeBench is a ~500-image evaluation dataset designed to assess whether segmentation algorithms can identify such segments. The dataset spans two complementary domains: high-resolution natural imagery sourced from EntitySeg and real-world robotic interaction scenes from Open X-Embodiment. Together, these domains support evaluation across both unconstrained natural scenes and structured physical environments. Example segments from SpelkeBench are displayed below, which we compare to SAM and EntitySeg segments to show that SpelkeBench contains annotated segments which align more with the Spelke notion.

Dataset Overview

SpelkeBench provides a standardized evaluation framework for Spelke segmentation with:

~500 images spanning natural and robotic scenes
Ground truth segments which align with Spelke concept based on physical motion coherence
Virtual poke points indicating where to apply segmentation queries (centroid)

Download the Dataset

Clone this repository and download the SpelkeBench dataset:

git clone https://github.com/neuroailab/SpelkeBench.git
cd SpelkeBench
bash download_spelke_bench.sh

This will download spelke_bench.h5 to the datasets/ directory.

Dataset Format

The dataset is provided as a single HDF5 file where each key corresponds to an image sample containing:

Field	Description	Shape
`rgb`	Input RGB image	`[H, W, 3]`
`segment`	Ground truth Spelke segments	`[N, H, W]`
`centroid`	Virtual poke locations (x, y)	`[N, 2]`

Where N is the number of ground truth segments/centroids for that image.

Evaluating Your Model on SpelkeBench

SpelkeBench provides tools to evaluate any segmentation model that can segment objects based on point prompts. The evaluation pipeline handles dataset loading, parallel inference, and metric computation.

Step 1: Install SpelkeBench

conda create -n spelkebench python=3.10 -y
conda activate spelkebench
pip install -e .

This installs the command-line utilities: spelkebench-infer, spelkebench-launch, and spelkebench-evaluate.

Step 2: Implement the Model Interface

Create a model class that inherits from spelke_bench.models.segmentation_class.SegmentationModel:

from spelke_bench.models.segmentation_class import SegmentationModel
import numpy as np

class YourSegmentationModel(SegmentationModel):
    def __init__(self):
        """Initialize your model, load weights, etc."""
        super().__init__()
        # Your initialization code here
        
    def run_inference(self, input_image, poke_point):
        """
        Perform segmentation based on a poke point.
        
        Args:
            input_image (np.ndarray): RGB image of shape [H, W, 3] with values in [0, 255] range
            poke_point (tuple): (x, y) coordinates
        
        Returns:
            np.ndarray: Binary segmentation mask of shape [H, W] with values in {0, 1}
        """
        # Your segmentation logic here
        pass

Step 3: Run Inference

Single GPU Inference

For quick testing or debugging on a subset of images:

spelkebench-infer \
  --model_name your_model.SegmentationModel \
  --dataset_path ./datasets/spelke_bench.h5 \
  --output_dir ./results/my_model \
  --device cuda:0 \
  --img_names entityseg_1_image2926 entityseg_2_image1258

Multi-Node Distributed Inference

For cluster environments with multiple nodes (e.g., 4 nodes with 4 GPUs each):

On each node, run:

spelkebench-launch \
  --gpus 0 1 2 3 \
  --dataset_path ./datasets/spelke_bench.h5 \
  --output_dir ./results/my_model \
  --num_splits 4 \
  --split_num <node_id> \
  --model_name your_model.SegmentationModel

Replace <node_id> with 0, 1, 2, or 3 for each respective node.

Step 5: Evaluate Results

Once inference is complete, compute metrics:

spelkebench-evaluate \
  --input_dir ./results/my_model \
  --output_dir ./results/my_model/metrics
  --dataset_path ./datasets/spelke_bench.h5

This will:

Generate visual comparisons between predictions and ground truth
Calculate and print Average Recall (AR) and mean IoU scores
Save per-image metrics and visualizations

Command-Line Arguments

Argument	Description
`--model_name`	Python path to your model class
`--dataset_path`	Path to spelke_bench.h5
`--output_dir`	Directory for saving predictions
`--device`	GPU device (e.g., cuda:0)
`--img_names`	Specific image keys to process
`--gpus`	GPU IDs for parallel processing
`--num_splits`	Total splits for multi-node
`--split_num`	Current split index

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
spelke_bench		spelke_bench
LICENSE		LICENSE
README.md		README.md
download_spelke_bench.sh		download_spelke_bench.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpelkeBench: Benchmarking Spelke Segmentation

The SpelkeBench Benchmark

Dataset Overview

Download the Dataset

Dataset Format

Evaluating Your Model on SpelkeBench

Step 1: Install SpelkeBench

Step 2: Implement the Model Interface

Step 3: Run Inference

Single GPU Inference

Multi-Node Distributed Inference

Step 5: Evaluate Results

Command-Line Arguments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

neuroailab/SpelkeBench

Folders and files

Latest commit

History

Repository files navigation

SpelkeBench: Benchmarking Spelke Segmentation

The SpelkeBench Benchmark

Dataset Overview

Download the Dataset

Dataset Format

Evaluating Your Model on SpelkeBench

Step 1: Install SpelkeBench

Step 2: Implement the Model Interface

Step 3: Run Inference

Single GPU Inference

Multi-Node Distributed Inference

Step 5: Evaluate Results

Command-Line Arguments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages