fix: count only real NaN/Inf events in training monitor#20
fix: count only real NaN/Inf events in training monitor#20ogulcanaydogan wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate false-positive NaN/Inf monitoring during long-running training (notably on A100) by adding an explicit NaN/Inf guard callback, refining how monitoring scripts count “real” NaN/Inf events, and introducing supporting operational tooling (systemd units + post-run utilities).
Changes:
- Add
NaNGuardCallback(and tests) to stop training after consecutive non-finite metric detections. - Wire training callbacks +
max_grad_norminto the training orchestrator and config. - Add remote-ops scripts/services: systemd units, watchdog, monitor, manifest generator, and post-training pipeline helpers.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_callbacks.py | Adds unit tests for NaNGuardCallback behavior (stop + reset). |
| src/forge/utils/config.py | Adds training.max_grad_norm to the pydantic config model. |
| src/forge/training/trainer.py | Wires max_grad_norm and adds early-stopping + NaN guard callbacks to the trainer. |
| src/forge/training/callbacks.py | Introduces NaNGuardCallback + non-finite detection helper and formalizes callback base class. |
| scripts/training_watchdog.py | Adds a watchdog that restarts systemd training service on repeated NaN lines / stalls. |
| scripts/start_a100_training.sh | Adds a systemd-friendly entrypoint for A100 training runs (resume support, cache env). |
| scripts/set_wandb_key.sh | Helper to safely write WANDB_API_KEY to a user env file and restart services. |
| scripts/post_training_pipeline.sh | Adds a post-run pipeline (manifest → evals → merge → optional serve checks). |
| scripts/monitor_a100_training.sh | Adds a progress/ETA/NaN counter monitor that writes status files. |
| scripts/install_training_services.sh | Installs/enables systemd user services and bootstraps env file. |
| scripts/generate_training_manifest.py | Adds deterministic manifest generator for completed training runs. |
| docs/TRAINING_GUIDE.md | Links to the new roadmap document. |
| docs/ROADMAP.md | Adds a structured post-run roadmap and definition-of-done. |
| deploy/systemd/forge-training.service | Adds training systemd user unit. |
| deploy/systemd/forge-training-watchdog.service | Adds watchdog systemd user unit. |
| deploy/systemd/forge-training-monitor.service | Adds monitor systemd user unit. |
| configs/models/turkcell_7b_a100_v4_recovery.yaml | Adds A100 recovery profile (incl. max_grad_norm). |
| configs/models/turkcell_7b_a100_v3_clean.yaml | Adds A100 “stable” profile (incl. max_grad_norm). |
| configs/models/turkcell_7b_a100_resume_bf16_clean.yaml | Adds A100 resume bf16 profile. |
| README.md | Adds a short “Post-Completion Roadmap” section linking to docs/ROADMAP.md. |
| Makefile | Adds manifest, post-train, set-wandb targets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #!/usr/bin/env bash | ||
| set -euo pipefail | ||
|
|
||
| cd /home/weezboo/projects/LowResource-LLM-Forge |
There was a problem hiding this comment.
cd /home/weezboo/... hard-codes a user-specific path and will break when the repo lives elsewhere (even though the systemd unit already sets WorkingDirectory). Consider switching to a configurable PROJECT_ROOT (as in other scripts) or using the script directory to resolve paths.
| cd /home/weezboo/projects/LowResource-LLM-Forge | |
| PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}" | |
| cd "$PROJECT_ROOT" |
| # Count only real NaN/Inf metric values and explicit NaN guard events. | ||
| metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)" | ||
| guard_nan_count="$(grep -a -E -c "nan_guard_detected|nan_guard_stopping_training" "$LOG_FILE" || true)" | ||
| metric_nan_count="${metric_nan_count:-0}" | ||
| guard_nan_count="${guard_nan_count:-0}" | ||
| nan_count=$((metric_nan_count + guard_nan_count)) | ||
| fi |
There was a problem hiding this comment.
nan_count will likely double-count NaN/Inf events: metric_nan_count matches any occurrence of '(loss|grad_norm|eval_loss)': ... (nan|inf) anywhere in the log, including inside bad_metrics={...} on nan_guard_detected lines, and then guard_nan_count counts those same guard lines again. To keep the counter accurate, exclude nan_guard_* lines from the metric grep (or tighten the metric-line pattern so it only matches the trainer’s metric dict lines) before adding guard_nan_count.
| # Count only real NaN/Inf metric values and explicit NaN guard events. | ||
| metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)" |
There was a problem hiding this comment.
The metric grep only matches nan|inf but not +inf / -inf, while the Python-side _is_non_finite treats those as non-finite too. If you want monitor parity with the callback, extend the pattern to include signed inf variants.
| # Count only real NaN/Inf metric values and explicit NaN guard events. | |
| metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)" | |
| # Count only real NaN/Inf metric values (including +inf/-inf) and explicit NaN guard events. | |
| metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|[+-]?inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)" |
| if current_metric_hash and current_metric_hash != state.last_metric_hash: | ||
| state.last_metric_hash = current_metric_hash | ||
| if "nan" in last_metric.lower(): | ||
| state.nan_consecutive += 1 | ||
| else: | ||
| state.nan_consecutive = 0 |
There was a problem hiding this comment.
The watchdog increments nan_consecutive only when 'nan' in last_metric.lower(). Since the monitor/callback logic treats Inf as non-finite too, this misses inf, +inf, -inf events and could fail to restart on an Inf blow-up. Consider checking for both NaN and Inf (ideally with the same regex used elsewhere).
| manifest: ## Generate training manifest (TRAIN_CONFIG/RUN_DIR/LOG_FILE) | ||
| @if [ -z "$(RUN_DIR)" ] || [ -z "$(LOG_FILE)" ]; then \ | ||
| echo "Usage: make manifest TRAIN_CONFIG=<yaml> RUN_DIR=<run_dir> LOG_FILE=<train_log>"; \ | ||
| exit 1; \ | ||
| fi | ||
| uv run python scripts/generate_training_manifest.py \ | ||
| --config "$(TRAIN_CONFIG)" \ | ||
| --run-dir "$(RUN_DIR)" \ | ||
| --log-file "$(LOG_FILE)" |
There was a problem hiding this comment.
make manifest uses $(TRAIN_CONFIG) but the argument validation only checks RUN_DIR and LOG_FILE. If a user clears TRAIN_CONFIG (or runs with an empty env var), this will fail later with a less actionable argparse error. Consider validating TRAIN_CONFIG in the same check (or defaulting it explicitly for this target).
Summary: fix false-positive nan_count in A100 monitor by counting only metric NaN/Inf values and explicit nan guard events. Validation: bash -n script passed; restarted monitor service on A100 and observed nan_count=0 during active training.