Skip to content

feat: MPS/CPU support for Apple Silicon (M3 Max)#362

Closed
jeong-sik wants to merge 3 commits intokarpathy:masterfrom
jeong-sik:feature/mps-support
Closed

feat: MPS/CPU support for Apple Silicon (M3 Max)#362
jeong-sik wants to merge 3 commits intokarpathy:masterfrom
jeong-sik:feature/mps-support

Conversation

@jeong-sik
Copy link

Summary

Karpathy autoresearch를 CUDA 없는 환경(Apple Silicon MPS, CPU)에서 실행 가능하게 적응.

  • FlashAttention 3 → PyTorch SDPA fallback (CUDA 없을 때 자동)
  • 조건부 torch.compile (CUDA only, MPS/CPU는 비활성)
  • MPS용 축소 설정: depth 8→6, batch 128→32, total_batch 2^19→2^16
  • Eval 최적화: MPS에서 EVAL_TOKENS 21M→4.2M (eval 시간 640s→120s)
  • kernels 패키지를 optional-dependency로 분리

Baseline (M3 Max, MPS)

Metric Value
val_bpb 1.697931
training_seconds 300.9
peak_vram_mb 273.8
mfu_percent 16.44
num_params_M 26.3
steps 87

Test plan

  • uv run train.py 완주 (87 steps, 5min budget)
  • val_bpb 측정 성공
  • CUDA 환경에서 regression 없는지 확인 (원본 동작 유지)

🤖 Generated with Claude Code

jeong-sik and others added 3 commits March 21, 2026 18:24
- Device auto-detection (CUDA > MPS > CPU)
- FA3 -> PyTorch SDPA fallback when kernels unavailable
- Conditional torch.compile (CUDA only)
- float32 on MPS (MPSGraph rejects mixed bf16/f32 ops)
- Device-agnostic dataloader and evaluation
- Adjusted hyperparameters for non-H100: DEPTH=6, batch=32, pattern="L"

M3 Max baseline: val_bpb=1.781, 66 steps/5min, MFU=11.9%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MPS eval was taking 9.4 minutes (longer than 5-min training budget).
Now auto-sets AUTORESEARCH_EVAL_TOKENS=4194304 on non-CUDA devices.
Override via env var if different eval budget needed.

Expected eval time: ~120s (from ~640s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. prepare.py: non_blocking=True was always set regardless of pin_memory
   availability, causing a data race on MPS where cpu_buffer is unpinned.
   Now non_blocking follows use_pinned (True only on CUDA with pinned memory).

2. train.py: SDPA fallback path silently ignored window_size, producing
   wrong attention when sliding window layers ("S" in WINDOW_PATTERN)
   were active. Added a runtime warning when window_size < seq_len to
   make the mismatch visible. Non-CUDA already defaults to "L" (full
   attention) via WINDOW_PATTERN, so this is a safety net.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jeong-sik
Copy link
Author

Closing - upstream repo, not my merge authority.

@jeong-sik jeong-sik closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant