feat: MPS/CPU support for Apple Silicon (M3 Max) by jeong-sik · Pull Request #362 · karpathy/autoresearch

jeong-sik · 2026-03-21T12:29:36Z

Summary

Karpathy autoresearch를 CUDA 없는 환경(Apple Silicon MPS, CPU)에서 실행 가능하게 적응.

FlashAttention 3 → PyTorch SDPA fallback (CUDA 없을 때 자동)
조건부 torch.compile (CUDA only, MPS/CPU는 비활성)
MPS용 축소 설정: depth 8→6, batch 128→32, total_batch 2^19→2^16
Eval 최적화: MPS에서 EVAL_TOKENS 21M→4.2M (eval 시간 ~~640s→~~120s)
kernels 패키지를 optional-dependency로 분리

Baseline (M3 Max, MPS)

Metric	Value
val_bpb	1.697931
training_seconds	300.9
peak_vram_mb	273.8
mfu_percent	16.44
num_params_M	26.3
steps	87

Test plan

uv run train.py 완주 (87 steps, 5min budget)
val_bpb 측정 성공
CUDA 환경에서 regression 없는지 확인 (원본 동작 유지)

🤖 Generated with Claude Code

- Device auto-detection (CUDA > MPS > CPU) - FA3 -> PyTorch SDPA fallback when kernels unavailable - Conditional torch.compile (CUDA only) - float32 on MPS (MPSGraph rejects mixed bf16/f32 ops) - Device-agnostic dataloader and evaluation - Adjusted hyperparameters for non-H100: DEPTH=6, batch=32, pattern="L" M3 Max baseline: val_bpb=1.781, 66 steps/5min, MFU=11.9% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MPS eval was taking 9.4 minutes (longer than 5-min training budget). Now auto-sets AUTORESEARCH_EVAL_TOKENS=4194304 on non-CUDA devices. Override via env var if different eval budget needed. Expected eval time: ~120s (from ~640s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. prepare.py: non_blocking=True was always set regardless of pin_memory availability, causing a data race on MPS where cpu_buffer is unpinned. Now non_blocking follows use_pinned (True only on CUDA with pinned memory). 2. train.py: SDPA fallback path silently ignored window_size, producing wrong attention when sliding window layers ("S" in WINDOW_PATTERN) were active. Added a runtime warning when window_size < seq_len to make the mismatch visible. Non-CUDA already defaults to "L" (full attention) via WINDOW_PATTERN, so this is a safety net. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jeong-sik · 2026-03-22T13:28:47Z

Closing - upstream repo, not my merge authority.

jeong-sik and others added 3 commits March 21, 2026 18:24

jeong-sik closed this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MPS/CPU support for Apple Silicon (M3 Max)#362

feat: MPS/CPU support for Apple Silicon (M3 Max)#362
jeong-sik wants to merge 3 commits intokarpathy:masterfrom
jeong-sik:feature/mps-support

jeong-sik commented Mar 21, 2026

Uh oh!

jeong-sik commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeong-sik commented Mar 21, 2026

Summary

Baseline (M3 Max, MPS)

Test plan

Uh oh!

jeong-sik commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant