feat: MPS/CPU support for Apple Silicon (M3 Max)#362
Closed
jeong-sik wants to merge 3 commits intokarpathy:masterfrom
Closed
feat: MPS/CPU support for Apple Silicon (M3 Max)#362jeong-sik wants to merge 3 commits intokarpathy:masterfrom
jeong-sik wants to merge 3 commits intokarpathy:masterfrom
Conversation
- Device auto-detection (CUDA > MPS > CPU) - FA3 -> PyTorch SDPA fallback when kernels unavailable - Conditional torch.compile (CUDA only) - float32 on MPS (MPSGraph rejects mixed bf16/f32 ops) - Device-agnostic dataloader and evaluation - Adjusted hyperparameters for non-H100: DEPTH=6, batch=32, pattern="L" M3 Max baseline: val_bpb=1.781, 66 steps/5min, MFU=11.9% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MPS eval was taking 9.4 minutes (longer than 5-min training budget). Now auto-sets AUTORESEARCH_EVAL_TOKENS=4194304 on non-CUDA devices. Override via env var if different eval budget needed. Expected eval time: ~120s (from ~640s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. prepare.py: non_blocking=True was always set regardless of pin_memory
availability, causing a data race on MPS where cpu_buffer is unpinned.
Now non_blocking follows use_pinned (True only on CUDA with pinned memory).
2. train.py: SDPA fallback path silently ignored window_size, producing
wrong attention when sliding window layers ("S" in WINDOW_PATTERN)
were active. Added a runtime warning when window_size < seq_len to
make the mismatch visible. Non-CUDA already defaults to "L" (full
attention) via WINDOW_PATTERN, so this is a safety net.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Closing - upstream repo, not my merge authority. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Karpathy autoresearch를 CUDA 없는 환경(Apple Silicon MPS, CPU)에서 실행 가능하게 적응.
640s→120s)kernels패키지를 optional-dependency로 분리Baseline (M3 Max, MPS)
Test plan
uv run train.py완주 (87 steps, 5min budget)🤖 Generated with Claude Code