-
Notifications
You must be signed in to change notification settings - Fork 0
Home

Apple Silicon port of karpathy/autoresearch with autonomous LLM-driven experiment optimization across multiple training datasets.
- TUI Dashboard — Real-time terminal dashboard for monitoring training runs
- Multi-Dataset Suite — Framework for running experiments across different training datasets
Five datasets tested on the same hardware (Apple M5 Max, 64 GB) — architecture converges to AR=32 across all datasets. Three of five converge to the exact same hyperparameters. Both FineWeb-Edu variants diverge, confirming that educational text's hyperparameter divergence is data-dependent, not path-dependent.
🥇 Haiku 4.5 (1.2953) > 🥈 Sonnet 4.6 (1.3093) > 🥉 Opus 4.6 (1.3569) > Sonnet 4.0 (1.3588). The cheapest model wins, the most expensive finishes 3rd. Rankings correlate perfectly with gradient steps (architecture discovery), not model cost or capability.
Read the full 4-model comparison →
| Date | Chip | Dataset | Experiments | Best val_bpb | LLM |
|---|---|---|---|---|---|
| Mar 27, 2026 — ClimbMix (100 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 100 | 1.357 | Opus 4.6 |
| Mar 26, 2026 — ClimbMix (100 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 100 | 1.309 | Sonnet 4.6 |
| Mar 26, 2026 — ClimbMix (100 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 100 | 1.359 | Sonnet 4.0 |
| Mar 25, 2026 — ClimbMix ⭐ (100 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 100 | 1.295 | Haiku 4.5 |
⭐ Haiku wins ClimbMix: 1.2953 > S4.6 1.3093 > Opus 1.3569 > S4.0 1.3588 — cheapest model beats most expensive
| Date | Chip | Dataset | Experiments | Best val_bpb | LLM |
|---|---|---|---|---|---|
| Mar 25, 2026 — FineWeb-Edu-High (101 experiments) | Apple M5 Max (64 GB) | FineWeb-Edu-High | 101 | 1.335 | Sonnet 4.6 |
| Mar 25, 2026 — SlimPajama (101 experiments) | Apple M5 Max (64 GB) | SlimPajama | 101 | 1.527 | Sonnet 4.6 |
| Mar 24, 2026 — Cosmopedia-v2 (101 experiments) | Apple M5 Max (64 GB) | Cosmopedia-v2 | 101 | 0.955 | Sonnet 4.6 |
| Mar 22, 2026 — FineWeb-Edu (100 experiments) | Apple M5 Max (64 GB) | FineWeb-Edu 10BT | 100 | 1.342 | Sonnet 4.6 |
| Mar 22, 2026 — ClimbMix (119 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 119 | 1.300 | Sonnet 4.6 |
| Date | Chip | Dataset | Experiments | Best val_bpb | LLM |
|---|---|---|---|---|---|
| Mar 21, 2026 — FineWeb-Edu-High (101 experiments) | Apple M5 Max (64 GB) | FineWeb-Edu-High | 101 | 1.346 | Sonnet 4.0 |
| Mar 20, 2026 — SlimPajama (101 experiments) | Apple M5 Max (64 GB) | SlimPajama | 101 | 1.526 | Sonnet 4.0 |
| Mar 20, 2026 — Cosmopedia-v2 (103 experiments) | Apple M5 Max (64 GB) | Cosmopedia-v2 | 103 | 0.961 | Sonnet 4.0 |
| Mar 19, 2026 — Climbmix (101 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 101 | 1.296 | Sonnet 4.0 |
| Mar 17, 2026 — FineWeb-Edu (88 experiments) | Apple M5 Max (64 GB) | FineWeb-Edu 10BT | 88 | 1.342 | Sonnet 4.0 |
| Mar 16, 2026 — Climbmix (81 experiments) | Apple M5 Max (64 GB) | climbmix-400b | 81 | 1.335 | Sonnet 4.0 |
| Date | Chip | Best val_bpb | Branch |
|---|---|---|---|
| Mar 15, 2026 — M5 Max | Apple M5 Max (64 GB) | 1.320 | autoresearch/mar14-m5max |
| Mar 14, 2026 — M4 Pro | Apple M4 Pro (24 GB) | 1.429 | autoresearch/mar14 |
| Mar 11, 2026 — M1 Max | Apple M1 Max (64 GB) | 1.621 | autoresearch/mar11 |
-
karpathy/autoresearch PR #303 — "Evaluating Experiment Results at Scale" by Dean Sharon. Guide for noise floor estimation, Pareto efficiency, and reading results.tsv at scale. Adapted for Apple Silicon in
docs/evaluating-results.md.
