Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[bumpversion]
current_version = 0.0.0
commit = True
tag = True

[bumpversion:file:raptor/__init__.py]
39 changes: 39 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: docs

on:
push:
branches: ["main"]
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

concurrency:
group: "pages"
cancel-in-progress: false

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/configure-pages@v5
- uses: actions/jekyll-build-pages@v1
with:
source: ./docs
destination: ./_site
- uses: actions/upload-pages-artifact@v3
with:
path: ./_site

deploy:
runs-on: ubuntu-latest
needs: build
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v4
20 changes: 20 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: tests

on:
push:
pull_request:

jobs:
pytest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install tox torch numpy
- name: Run tests
run: tox -q
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ build/
*.sqlite

# Data (common in research repos)
data/
outputs/
wandb/
models/
Expand Down
26 changes: 26 additions & 0 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[MASTER]
disable=
R0903, # allows to expose only one public method
R0914, # allow multiples local variables
E0401, # pending issue with pylint see pylint#2603
E1123, # issues between pylint and tensorflow since 2.2.0
E1120, # see pylint#3613
C3001, # lambda function as variable
C0116, C0114, # docstring
C0103, # we refer and define scientific notation
R1721, W0107, # use pass for abstract class
E1102, # pylint false positive when .to() is used

[FORMAT]
max-line-length=120
max-args=12
max-attributes=12

[SIMILARITIES]
min-similarity-lines=6
ignore-comments=yes
ignore-docstrings=yes
ignore-imports=no

[TYPECHECK]
ignored-modules=torch, cv2
51 changes: 27 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,56 @@
# Block-Recurrent Dynamics in ViTs

<div align="center">
<img src="docs/raptor_logo.png" width="25%" />
</div>

**Authors:** Mozes Jacobs\*, Thomas Fel\*, Richard Hakim\*, Alessandra Brondetta, Demba Ba, T. Andy Keller
# Block-Recurrent Dynamics in ViTs (Raptor)

<img src="assets/raptor_logo.png" width="33%" />

[![tests](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml)
[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/1234.56789)

[**Mozes Jacobs**](https://scholar.google.com/citations?user=8Dm0KfQAAAAJ&hl=en)$^{\star1}$ &nbsp; [**Thomas Fel**](https://thomasfel.me)$^{\star1}$ &nbsp; [**Richard Hakim**](https://richhakim.com/)$^{\star1}$
<br>
[**Alessandra Brondetta**](https://alessandrabrondetta.github.io/)$^{2}$ &nbsp; [**Demba Ba**](https://scholar.google.com/citations?user=qHiACEgAAAAJ&hl=en)$^{1,3}$ &nbsp; [**T. Andy Keller**](https://akandykeller.github.io/)$^{1}$

\* Equal contribution. Correspondence to {mozesjacobs,tfel,rhakim,takeller}@g.harvard.edu
<small>

$^1$**Kempner Institute, Harvard University** &nbsp; $^2$**Osnabrück University** &nbsp; $^3$**Harvard University**

</small>

</div>

---

**Abstract:**
As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential.
Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow.
In this work, we introduce the **Block-Recurrent Hypothesis (BRH)**, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original L blocks can be accurately rewritten using only k << L distinct blocks applied recurrently.
Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. Yet, representational similarity does not necessarily translate to functional similarity.
To determine whether these phases reflect genuinely reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call **R**ecurrent **A**pproximations to **P**hase-structured **T**ransf**OR**mers (`Raptor`).
Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit `Raptor`, and identify the role of training and stochastic depth in promoting the recurrent block structure.
We then provide an empirical existence proof for BRH in foundation models by showing that we can train a `Raptor` model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks while maintaining equivalent computational cost.
To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of **Dynamical Interpretability**. We find **(i)** directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, **(ii)** token-specific dynamics, where `cls` executes sharp late reorientations while `patch` tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction, and **(iii)** a collapse of the update to low rank in late depth, consistent with convergence to low-dimensional attractors.<br>
**tl;dr** Our work introduces the Block-Recurrent Hypothesis (BRH), by noticing that foundation models like DINOv2 can be rewritten using only two recurrent blocks to recover 96% of the original accuracy. We leverage our framework and explore a Dynamical Interpretability approach where we interpret token evolution through layers as trajectories and show that they converge into class-dependent angular basins while late-stage updates collapse into low-rank attractors.

Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
Ultimately, the study reveals that Vision Transformers seems to naturally converge toward compact, iterative programs instead of unique layer-by-layer transformations (indicating a lower algorithmic complexity / Kolmogorov complexity).

---

## Setup

### Environment
To run the code, you will need to create a mamba (or conda) environment from the `environment.yml` file.
Create and activate the environment with
Create and activate the environment with
```bash
mamba env create -f environment.yml
mamba activate raptor
```

### Paths
Edit src/paths.py to have the correct absolute paths to different datasets.
Edit `src/paths.py` to have the correct absolute paths to different datasets.

### Extracting DINOv2 Activations for ImageNet-1k
For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
We provide a script to extract the activations from the ImageNet-1k dataset. This script is available in the `data` directory.
This script takes around 5 hours to run on 1 H100 GPU, and storing the activations requires a lot of disk space.
```bash
cd data
python 000_precompute_dinov2_act.py
python precompute_dinov2_act.py
```

### Download Pretrained Classifiers
Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
These are used during training of `Raptor`.

```bash
Expand Down Expand Up @@ -94,11 +97,11 @@ python train_probe.py --variant raptor3 --model_seed 1101 --seed 6005
## Reproducing Foundation Models Results (Section 3)
To reproduce the results for the foundation models section (Table 1 and Figure 7), do the following:

1. Determine max-cut segmentations. This has been done for you in src/000_max_cut_dinov2_base.ipynb.
1. Determine max-cut segmentations. This has been done for you in src/max_cut_dinov2_base.ipynb.
2. Train each block independently.
```bash
cd src/runs
sbatch 001_blocks.sh
sbatch blocks.sh
```
3. Train the full model with the pretrained blocks.
```bash
Expand Down Expand Up @@ -126,4 +129,4 @@ cd src
python aggregate_results.py
```
6. Figure 7
Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
Binary file added assets/raptor_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 10 additions & 20 deletions data/000_precompute_dinov2_act.py → data/precompute_dinov2_act.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
from numcodecs import Blosc
import zarr
from tqdm import tqdm
import threading
import queue
from paths import IMAGENET_TRAIN_DIR, DATA_DIR
import os
import random
import numpy as np
Expand All @@ -10,7 +16,6 @@

import sys
sys.path.append("../src/")
from paths import IMAGENET_TRAIN_DIR, DATA_DIR

IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
Expand Down Expand Up @@ -46,6 +51,7 @@
num_workers=4, pin_memory=True, persistent_workers=True, prefetch_factor=2
)


def inference(dino, x):
with torch.no_grad():
x = x.cuda().float()
Expand All @@ -60,16 +66,7 @@ def inference(dino, x):

return activations.transpose(0, 1)

import os
import queue
import threading
import numpy as np
import torch
from tqdm import tqdm
import zarr
from numcodecs import Blosc

# Zarr parameters
num_samples = len(dataset)
num_layers = 13
num_tokens = 261
Expand All @@ -88,9 +85,9 @@ def inference(dino, x):
zarr_version=2
)

# Writer queue
write_queue = queue.Queue(maxsize=24)


def writer_thread(zarr_array, q):
while True:
item = q.get()
Expand All @@ -108,11 +105,10 @@ def writer_thread(zarr_array, q):
print(f"[Writer Error] At index {index}: {e}")
q.task_done()

# Launch background writer thread

writer = threading.Thread(target=writer_thread, args=(z, write_queue))
writer.start()

# Main loop
index = 0
for batch in tqdm(dataloader, desc="Writing activations to Zarr"):
x, _ = batch
Expand All @@ -125,13 +121,7 @@ def writer_thread(zarr_array, q):
write_queue.put((index, acts_np))
index += acts_np.shape[0]

# Flush and stop
# flush and stop
write_queue.join()
write_queue.put(None)
writer.join()

print("✅ All activations written successfully.")

#import zarr
#z = zarr.open(DATA_DIR + '/activations.zarr', mode='r')
#print(z.shape) # should match expected total samples
2 changes: 1 addition & 1 deletion data/000_submit.sh → data/precompute_job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ module load gcc/13.2.0-fasrc01
source ~/.bashrc
mamba activate slot_attention6

python 000_precompute_dinov2_act.py
python precompute_dinov2_act.py
51 changes: 27 additions & 24 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,56 @@
# Block-Recurrent Dynamics in ViTs

<div align="center">
<img src="raptor_logo.png" width="25%" />
</div>

**Authors:** Mozes Jacobs\*, Thomas Fel\*, Richard Hakim\*, Alessandra Brondetta, Demba Ba, T. Andy Keller
# Block-Recurrent Dynamics in ViTs (Raptor)

<img src="assets/raptor_logo.png" width="33%" />

[![tests](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/KempnerInstitute/raptor/actions/workflows/tests.yml)
[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/1234.56789)

[**Mozes Jacobs**](https://scholar.google.com/citations?user=8Dm0KfQAAAAJ&hl=en)$^{\star1}$ &nbsp; [**Thomas Fel**](https://thomasfel.me)$^{\star1}$ &nbsp; [**Richard Hakim**](https://richhakim.com/)$^{\star1}$
<br>
[**Alessandra Brondetta**](https://alessandrabrondetta.github.io/)$^{2}$ &nbsp; [**Demba Ba**](https://scholar.google.com/citations?user=qHiACEgAAAAJ&hl=en)$^{1,3}$ &nbsp; [**T. Andy Keller**](https://akandykeller.github.io/)$^{1}$

\* Equal contribution. Correspondence to {mozesjacobs,tfel,rhakim,takeller}@g.harvard.edu
<small>

$^1$**Kempner Institute, Harvard University** &nbsp; $^2$**Osnabrück University** &nbsp; $^3$**Harvard University**

</small>

</div>

---

**Abstract:**
As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential.
Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow.
In this work, we introduce the **Block-Recurrent Hypothesis (BRH)**, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original L blocks can be accurately rewritten using only k << L distinct blocks applied recurrently.
Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. Yet, representational similarity does not necessarily translate to functional similarity.
To determine whether these phases reflect genuinely reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call **R**ecurrent **A**pproximations to **P**hase-structured **T**ransf**OR**mers (`Raptor`).
Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit `Raptor`, and identify the role of training and stochastic depth in promoting the recurrent block structure.
We then provide an empirical existence proof for BRH in foundation models by showing that we can train a `Raptor` model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks while maintaining equivalent computational cost.
To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of **Dynamical Interpretability**. We find **(i)** directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, **(ii)** token-specific dynamics, where `cls` executes sharp late reorientations while `patch` tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction, and **(iii)** a collapse of the update to low rank in late depth, consistent with convergence to low-dimensional attractors.<br>
**tl;dr** Our work introduces the Block-Recurrent Hypothesis (BRH), by noticing that foundation models like DINOv2 can be rewritten using only two recurrent blocks to recover 96% of the original accuracy. We leverage our framework and explore a Dynamical Interpretability approach where we interpret token evolution through layers as trajectories and show that they converge into class-dependent angular basins while late-stage updates collapse into low-rank attractors.

Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
Ultimately, the study reveals that Vision Transformers seems to naturally converge toward compact, iterative programs instead of unique layer-by-layer transformations (indicating a lower algorithmic complexity / Kolmogorov complexity).

---

## Setup

### Environment
To run the code, you will need to create a mamba (or conda) environment from the `environment.yml` file.
Create and activate the environment with
Create and activate the environment with
```bash
mamba env create -f environment.yml
mamba activate raptor
```

### Paths
Edit src/paths.py to have the correct absolute paths to different datasets.
Edit `src/paths.py` to have the correct absolute paths to different datasets.

### Extracting DINOv2 Activations for ImageNet-1k
For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
For ImageNet, we precompute the DINOv2 activations so that `Raptor` can train faster.
We provide a script to extract the activations from the ImageNet-1k dataset. This script is available in the `data` directory.
This script takes around 5 hours to run on 1 H100 GPU, and storing the activations requires a lot of disk space.
```bash
cd data
python 000_precompute_dinov2_act.py
python precompute_dinov2_act.py
```

### Download Pretrained Classifiers
Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
Download the DINOv2 linear heads from Meta's [repository](https://github.com/facebookresearch/dinov2).
These are used during training of `Raptor`.

```bash
Expand Down Expand Up @@ -94,11 +97,11 @@ python train_probe.py --variant raptor3 --model_seed 1101 --seed 6005
## Reproducing Foundation Models Results (Section 3)
To reproduce the results for the foundation models section (Table 1 and Figure 7), do the following:

1. Determine max-cut segmentations. This has been done for you in src/000_max_cut_dinov2_base.ipynb.
1. Determine max-cut segmentations. This has been done for you in src/max_cut_dinov2_base.ipynb.
2. Train each block independently.
```bash
cd src/runs
sbatch 001_blocks.sh
sbatch blocks.sh
```
3. Train the full model with the pretrained blocks.
```bash
Expand Down Expand Up @@ -126,4 +129,4 @@ cd src
python aggregate_results.py
```
6. Figure 7
Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
Run the notebook in src/imagenet_probes/101_eval_error_bars.ipynb.
Binary file removed docs/raptor_logo.png
Binary file not shown.
Loading