KempnerInstitute · fel-thomas · Dec 23, 2025 · Dec 23, 2025 · Dec 23, 2025 · Dec 23, 2025
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -0,0 +1,6 @@
+[bumpversion]
+current_version = 0.0.0
+commit = True
+tag = True
+
+[bumpversion:file:raptor/__init__.py]
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,39 @@
+name: docs
+
+on:
+  push:
+    branches: ["main"]
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/configure-pages@v5
+      - uses: actions/jekyll-build-pages@v1
+        with:
+          source: ./docs
+          destination: ./_site
+      - uses: actions/upload-pages-artifact@v3
+        with:
+          path: ./_site
+
+  deploy:
+    runs-on: ubuntu-latest
+    needs: build
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    steps:
+      - id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -0,0 +1,20 @@
+name: tests
+
+on:
+  push:
+  pull_request:
+
+jobs:
+  pytest:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install tox torch numpy
+      - name: Run tests
+        run: tox -q
diff --git a/.gitignore b/.gitignore
@@ -29,7 +29,6 @@ build/
 *.sqlite
 
 # Data (common in research repos)
-data/
 outputs/
 wandb/
 models/

diff --git a/.pylintrc b/.pylintrc
@@ -0,0 +1,26 @@
+[MASTER]
+disable=
+    R0903, # allows to expose only one public method
+    R0914, # allow multiples local variables
+    E0401, # pending issue with pylint see pylint#2603
+    E1123, # issues between pylint and tensorflow since 2.2.0
+    E1120, # see pylint#3613
+    C3001, # lambda function as variable
+    C0116, C0114, # docstring
+    C0103, # we refer and define scientific notation
+    R1721, W0107, # use pass for abstract class
+    E1102, # pylint false positive when .to() is used
+
+[FORMAT]
+max-line-length=120
+max-args=12
+max-attributes=12
+
+[SIMILARITIES]
+min-similarity-lines=6
+ignore-comments=yes
+ignore-docstrings=yes
+ignore-imports=no
+
+[TYPECHECK]
+ignored-modules=torch, cv2
diff --git a/README.md b/README.md
@@ -1,53 +1,56 @@
-# Block-Recurrent Dynamics in ViTs
-
 <div align="center">
-<img src="docs/raptor_logo.png" width="25%" />
-</div>
 
-**Authors:** Mozes Jacobs\*, Thomas Fel\*, Richard Hakim\*, Alessandra Brondetta, Demba Ba, T. Andy Keller
+# Block-Recurrent Dynamics in ViTs (Raptor)
+
+<img src="assets/raptor_logo.png" width="33%" />
+
+[![tests](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml)
+[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/1234.56789)
+
+[**Mozes Jacobs**](https://scholar.google.com/citations?user=8Dm0KfQAAAAJ&hl=en)$^{\star1}$ &nbsp; [**Thomas Fel**](https://thomasfel.me)$^{\star1}$ &nbsp; [**Richard Hakim**](https://richhakim.com/)$^{\star1}$
+<br>
+[**Alessandra Brondetta**](https://alessandrabrondetta.github.io/)$^{2}$ &nbsp; [**Demba Ba**](https://scholar.google.com/citations?user=qHiACEgAAAAJ&hl=en)$^{1,3}$ &nbsp; [**T. Andy Keller**](https://akandykeller.github.io/)$^{1}$
 
-\* Equal contribution. Correspondence to {mozesjacobs,tfel,rhakim,takeller}@g.harvard.edu
+<small>
+
+$^1$**Kempner Institute, Harvard University** &nbsp; $^2$**Osnabrück University** &nbsp; $^3$**Harvard University**
+
+</small>
+
+</div>
 
 ---
 
-**Abstract:** 
-As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential.
-Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow.
-In this work, we introduce the **Block-Recurrent Hypothesis (BRH)**, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original L blocks can be accurately rewritten using only k << L distinct blocks applied recurrently.
-Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. Yet, representational similarity does not necessarily translate to functional similarity.
-To determine whether these phases reflect genuinely reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call **R**ecurrent **A**pproximations to **P**hase-structured **T**ransf**OR**mers (`Raptor`). 
-Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit `Raptor`, and identify the role of training and stochastic depth in promoting the recurrent block structure. 
-We then provide an empirical existence proof for BRH in foundation models by showing that we can train a `Raptor` model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks while maintaining equivalent computational cost. 
-To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of **Dynamical Interpretability**. We find **(i)** directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, **(ii)** token-specific dynamics, where `cls` executes sharp late reorientations while `patch` tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction, and **(iii)** a collapse of the update to low rank in late depth, consistent with convergence to low-dimensional attractors.<br>
+**tl;dr** Our work introduces the Block-Recurrent Hypothesis (BRH), by noticing that foundation models like DINOv2 can be rewritten using only two recurrent blocks to recover 96% of the original accuracy. We leverage our framework and explore a Dynamical Interpretability approach where we interpret token evolution through layers as trajectories and show that they converge into class-dependent angular basins while late-stage updates collapse into low-rank attractors.
 
-Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
+Ultimately, the study reveals that Vision Transformers seems to naturally converge toward compact, iterative programs instead of unique layer-by-layer transformations (indicating a lower algorithmic complexity / Kolmogorov complexity).
 
 ---
 
 ## Setup
 
 ### Environment
 To run the code, you will need to create a mamba (or conda) environment from the `environment.yml` file.
-Create and activate the environment with 
+Create and activate the environment with
 ```bash
 mamba env create -f environment.yml
 mamba activate raptor
 ```
 
 ### Paths
-Edit src/paths.py to have the correct absolute paths to different datasets.
+Edit `src/paths.py` to have the correct absolute paths to different datasets.
 
 ### Extracting DINOv2 Activations for ImageNet-1k
-For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster. 
+For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
 We provide a script to extract the activations from the ImageNet-1k dataset. This script is available in the `data` directory.
 This script takes around 5 hours to run on 1 H100 GPU, and storing the activations requires a lot of disk space.
 ```bash
 cd data
-python 000_precompute_dinov2_act.py
+python precompute_dinov2_act.py
 ```
 
 ### Download Pretrained Classifiers
-Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2). 
+Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
 These are used during training of `Raptor`.
 
 ```bash
@@ -94,11 +97,11 @@ python train_probe.py --variant raptor3 --model_seed 1101 --seed 6005
 ## Reproducing Foundation Models Results (Section 3)
 To reproduce the results for the foundation models section (Table 1 and Figure 7), do the following:
 
-1. Determine max-cut segmentations. This has been done for you in src/000_max_cut_dinov2_base.ipynb.
+1. Determine max-cut segmentations. This has been done for you in src/max_cut_dinov2_base.ipynb.
 2. Train each block independently.
 ```bash
 cd src/runs
-sbatch 001_blocks.sh
+sbatch blocks.sh
 ```
 3. Train the full model with the pretrained blocks.
 ```bash
@@ -126,4 +129,4 @@ cd src
 python aggregate_results.py
 ```
 6. Figure 7
-Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
+Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
diff --git a/assets/raptor_logo.png b/assets/raptor_logo.png
diff --git a/data/000_precompute_dinov2_act.py → data/precompute_dinov2_act.py b/data/000_precompute_dinov2_act.py → data/precompute_dinov2_act.py
@@ -1,3 +1,9 @@
+from numcodecs import Blosc
+import zarr
+from tqdm import tqdm
+import threading
+import queue
+from paths import IMAGENET_TRAIN_DIR, DATA_DIR
 import os
 import random
 import numpy as np
@@ -10,7 +16,6 @@
 
 import sys
 sys.path.append("../src/")
-from paths import IMAGENET_TRAIN_DIR, DATA_DIR
 
 IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
 IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
@@ -46,6 +51,7 @@
     num_workers=4, pin_memory=True, persistent_workers=True, prefetch_factor=2
 )
 
+
 def inference(dino, x):
     with torch.no_grad():
         x = x.cuda().float()
@@ -60,16 +66,7 @@ def inference(dino, x):
 
         return activations.transpose(0, 1)
 
-import os
-import queue
-import threading
-import numpy as np
-import torch
-from tqdm import tqdm
-import zarr
-from numcodecs import Blosc
 
-# Zarr parameters
 num_samples = len(dataset)
 num_layers = 13
 num_tokens = 261
@@ -88,9 +85,9 @@ def inference(dino, x):
     zarr_version=2
 )
 
-# Writer queue
 write_queue = queue.Queue(maxsize=24)
 
+
 def writer_thread(zarr_array, q):
     while True:
         item = q.get()
@@ -108,11 +105,10 @@ def writer_thread(zarr_array, q):
             print(f"[Writer Error] At index {index}: {e}")
         q.task_done()
 
-# Launch background writer thread
+
 writer = threading.Thread(target=writer_thread, args=(z, write_queue))
 writer.start()
 
-# Main loop
 index = 0
 for batch in tqdm(dataloader, desc="Writing activations to Zarr"):
     x, _ = batch
@@ -125,13 +121,7 @@ def writer_thread(zarr_array, q):
     write_queue.put((index, acts_np))
     index += acts_np.shape[0]
 
-# Flush and stop
+# flush and stop
 write_queue.join()
 write_queue.put(None)
 writer.join()
-
-print("✅ All activations written successfully.")
-
-#import zarr
-#z = zarr.open(DATA_DIR + '/activations.zarr', mode='r')
-#print(z.shape)  # should match expected total samples
diff --git a/data/000_submit.sh → data/precompute_job.sh b/data/000_submit.sh → data/precompute_job.sh
@@ -14,4 +14,4 @@ module load gcc/13.2.0-fasrc01
 source ~/.bashrc
 mamba activate slot_attention6
 
-python 000_precompute_dinov2_act.py
+python precompute_dinov2_act.py
diff --git a/docs/index.md b/docs/index.md
@@ -1,53 +1,56 @@
-# Block-Recurrent Dynamics in ViTs
-
 <div align="center">
-<img src="raptor_logo.png" width="25%" />
-</div>
 
-**Authors:** Mozes Jacobs\*, Thomas Fel\*, Richard Hakim\*, Alessandra Brondetta, Demba Ba, T. Andy Keller
+# Block-Recurrent Dynamics in ViTs (Raptor)
+
+<img src="assets/raptor_logo.png" width="33%" />
+
+[![tests](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml)
+[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/1234.56789)
+
+[**Mozes Jacobs**](https://scholar.google.com/citations?user=8Dm0KfQAAAAJ&hl=en)$^{\star1}$ &nbsp; [**Thomas Fel**](https://thomasfel.me)$^{\star1}$ &nbsp; [**Richard Hakim**](https://richhakim.com/)$^{\star1}$
+<br>
+[**Alessandra Brondetta**](https://alessandrabrondetta.github.io/)$^{2}$ &nbsp; [**Demba Ba**](https://scholar.google.com/citations?user=qHiACEgAAAAJ&hl=en)$^{1,3}$ &nbsp; [**T. Andy Keller**](https://akandykeller.github.io/)$^{1}$
 
-\* Equal contribution. Correspondence to {mozesjacobs,tfel,rhakim,takeller}@g.harvard.edu
+<small>
+
+$^1$**Kempner Institute, Harvard University** &nbsp; $^2$**Osnabrück University** &nbsp; $^3$**Harvard University**
+
+</small>
+
+</div>
 
 ---
 
-**Abstract:** 
-As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential.
-Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow.
-In this work, we introduce the **Block-Recurrent Hypothesis (BRH)**, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original L blocks can be accurately rewritten using only k << L distinct blocks applied recurrently.
-Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. Yet, representational similarity does not necessarily translate to functional similarity.
-To determine whether these phases reflect genuinely reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call **R**ecurrent **A**pproximations to **P**hase-structured **T**ransf**OR**mers (`Raptor`). 
-Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit `Raptor`, and identify the role of training and stochastic depth in promoting the recurrent block structure. 
-We then provide an empirical existence proof for BRH in foundation models by showing that we can train a `Raptor` model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks while maintaining equivalent computational cost. 
-To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of **Dynamical Interpretability**. We find **(i)** directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, **(ii)** token-specific dynamics, where `cls` executes sharp late reorientations while `patch` tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction, and **(iii)** a collapse of the update to low rank in late depth, consistent with convergence to low-dimensional attractors.<br>
+**tl;dr** Our work introduces the Block-Recurrent Hypothesis (BRH), by noticing that foundation models like DINOv2 can be rewritten using only two recurrent blocks to recover 96% of the original accuracy. We leverage our framework and explore a Dynamical Interpretability approach where we interpret token evolution through layers as trajectories and show that they converge into class-dependent angular basins while late-stage updates collapse into low-rank attractors.
 
-Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
+Ultimately, the study reveals that Vision Transformers seems to naturally converge toward compact, iterative programs instead of unique layer-by-layer transformations (indicating a lower algorithmic complexity / Kolmogorov complexity).
 
 ---
 
 ## Setup
 
 ### Environment
 To run the code, you will need to create a mamba (or conda) environment from the `environment.yml` file.
-Create and activate the environment with 
+Create and activate the environment with
 ```bash
 mamba env create -f environment.yml
 mamba activate raptor
 ```
 
 ### Paths
-Edit src/paths.py to have the correct absolute paths to different datasets.
+Edit `src/paths.py` to have the correct absolute paths to different datasets.
 
 ### Extracting DINOv2 Activations for ImageNet-1k
-For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster. 
+For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
 We provide a script to extract the activations from the ImageNet-1k dataset. This script is available in the `data` directory.
 This script takes around 5 hours to run on 1 H100 GPU, and storing the activations requires a lot of disk space.
 ```bash
 cd data
-python 000_precompute_dinov2_act.py
+python precompute_dinov2_act.py
 ```
 
 ### Download Pretrained Classifiers
-Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2). 
+Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
 These are used during training of `Raptor`.
 
 ```bash
@@ -94,11 +97,11 @@ python train_probe.py --variant raptor3 --model_seed 1101 --seed 6005
 ## Reproducing Foundation Models Results (Section 3)
 To reproduce the results for the foundation models section (Table 1 and Figure 7), do the following:
 
-1. Determine max-cut segmentations. This has been done for you in src/000_max_cut_dinov2_base.ipynb.
+1. Determine max-cut segmentations. This has been done for you in src/max_cut_dinov2_base.ipynb.
 2. Train each block independently.
 ```bash
 cd src/runs
-sbatch 001_blocks.sh
+sbatch blocks.sh
 ```
 3. Train the full model with the pretrained blocks.
 ```bash
@@ -126,4 +129,4 @@ cd src
 python aggregate_results.py
 ```
 6. Figure 7
-Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
+Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
diff --git a/docs/raptor_logo.png b/docs/raptor_logo.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -29,7 +29,6 @@ build/ @@
     *.sqlite
     # Data (common in research repos)
-    data/
     outputs/
     wandb/
     models/
@@ Expand Down @@