feat(train): add preflight run guardrails and setup failure hints by kaizen-38 · Pull Request #343 · karpathy/autoresearch

kaizen-38 · 2026-03-19T16:51:27Z

Summary

Add a small reproducibility/guardrail layer for training runs by introducing explicit preflight checks and clearer failure paths.

What changed

train.py
- Move CUDA/kernel assumptions out of import-time and into startup preflight.
- Add PRECHECK_FAIL output with structured reason/hint for setup failures.
- Validate setup before training starts:
  - CUDA availability and visible device
  - tokenizer artifacts exist in cache
  - WINDOW_PATTERN validity
  - integer gradient accumulation (TOTAL_BATCH_SIZE % (DEVICE_BATCH_SIZE * MAX_SEQ_LEN) == 0)
- Add clearer divergence failure (RuntimeError) for non-finite/exploding loss.
program.md
- Teach experiment loop to treat PRECHECK_FAIL as setup issue (not experiment crash).
- Instruct agent to stop and ask human for environment fix when preflight fails.

Why

This makes failures easier to diagnose, reduces noisy crash loops caused by setup issues, and improves reproducibility by validating key assumptions up front.

feat(train): add preflight guardrails and clearer failure paths

541e5e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(train): add preflight run guardrails and setup failure hints#343

feat(train): add preflight run guardrails and setup failure hints#343
kaizen-38 wants to merge 1 commit intokarpathy:masterfrom
kaizen-38:feat/run-guardrails

kaizen-38 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaizen-38 commented Mar 19, 2026

Summary

What changed

Why

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant