Skip to content
Dave Graham edited this page Mar 27, 2026 · 32 revisions

Autoresearch Banner

Autoresearch — Characterization & Experimentation

Apple Silicon port of karpathy/autoresearch with autonomous LLM-driven experiment optimization across multiple training datasets.

Tools

  • TUI Dashboard — Real-time terminal dashboard for monitoring training runs
  • Multi-Dataset Suite — Framework for running experiments across different training datasets

Cross-Dataset Comparison

Five datasets tested on the same hardware (Apple M5 Max, 64 GB) — architecture converges to AR=32 across all datasets. Three of five converge to the exact same hyperparameters. Both FineWeb-Edu variants diverge, confirming that educational text's hyperparameter divergence is data-dependent, not path-dependent.

Cross-Dataset Comparison

Read the full analysis →


Stock Baseline Comparison — ClimbMix Complete (4 models)

🥇 Haiku 4.5 (1.2953) > 🥈 Sonnet 4.6 (1.3093) > 🥉 Opus 4.6 (1.3569) > Sonnet 4.0 (1.3588). The cheapest model wins, the most expensive finishes 3rd. Rankings correlate perfectly with gradient steps (architecture discovery), not model cost or capability.

Read the full 4-model comparison →


Autonomous Agent Runs

Stock Baseline Runs (Fair Cross-Model Comparison)

Date Chip Dataset Experiments Best val_bpb LLM
Mar 27, 2026 — ClimbMix (100 experiments) Apple M5 Max (64 GB) climbmix-400b 100 1.357 Opus 4.6
Mar 26, 2026 — ClimbMix (100 experiments) Apple M5 Max (64 GB) climbmix-400b 100 1.309 Sonnet 4.6
Mar 26, 2026 — ClimbMix (100 experiments) Apple M5 Max (64 GB) climbmix-400b 100 1.359 Sonnet 4.0
Mar 25, 2026 — ClimbMix ⭐ (100 experiments) Apple M5 Max (64 GB) climbmix-400b 100 1.295 Haiku 4.5

⭐ Haiku wins ClimbMix: 1.2953 > S4.6 1.3093 > Opus 1.3569 > S4.0 1.3588 — cheapest model beats most expensive

Sonnet 4.6 (Cross-Generation Comparison)

Date Chip Dataset Experiments Best val_bpb LLM
Mar 25, 2026 — FineWeb-Edu-High (101 experiments) Apple M5 Max (64 GB) FineWeb-Edu-High 101 1.335 Sonnet 4.6
Mar 25, 2026 — SlimPajama (101 experiments) Apple M5 Max (64 GB) SlimPajama 101 1.527 Sonnet 4.6
Mar 24, 2026 — Cosmopedia-v2 (101 experiments) Apple M5 Max (64 GB) Cosmopedia-v2 101 0.955 Sonnet 4.6
Mar 22, 2026 — FineWeb-Edu (100 experiments) Apple M5 Max (64 GB) FineWeb-Edu 10BT 100 1.342 Sonnet 4.6
Mar 22, 2026 — ClimbMix (119 experiments) Apple M5 Max (64 GB) climbmix-400b 119 1.300 Sonnet 4.6

Sonnet 4.0 (Original Suite)

Date Chip Dataset Experiments Best val_bpb LLM
Mar 21, 2026 — FineWeb-Edu-High (101 experiments) Apple M5 Max (64 GB) FineWeb-Edu-High 101 1.346 Sonnet 4.0
Mar 20, 2026 — SlimPajama (101 experiments) Apple M5 Max (64 GB) SlimPajama 101 1.526 Sonnet 4.0
Mar 20, 2026 — Cosmopedia-v2 (103 experiments) Apple M5 Max (64 GB) Cosmopedia-v2 103 0.961 Sonnet 4.0
Mar 19, 2026 — Climbmix (101 experiments) Apple M5 Max (64 GB) climbmix-400b 101 1.296 Sonnet 4.0
Mar 17, 2026 — FineWeb-Edu (88 experiments) Apple M5 Max (64 GB) FineWeb-Edu 10BT 88 1.342 Sonnet 4.0
Mar 16, 2026 — Climbmix (81 experiments) Apple M5 Max (64 GB) climbmix-400b 81 1.335 Sonnet 4.0

Experiment Logs

Date Chip Best val_bpb Branch
Mar 15, 2026 — M5 Max Apple M5 Max (64 GB) 1.320 autoresearch/mar14-m5max
Mar 14, 2026 — M4 Pro Apple M4 Pro (24 GB) 1.429 autoresearch/mar14
Mar 11, 2026 — M1 Max Apple M1 Max (64 GB) 1.621 autoresearch/mar11

References

Clone this wiki locally