fix: count only real NaN/Inf events in training monitor by ogulcanaydogan · Pull Request #20 · ogulcanaydogan/LowResource-LLM-Forge

ogulcanaydogan · 2026-02-24T07:31:59Z

Summary: fix false-positive nan_count in A100 monitor by counting only metric NaN/Inf values and explicit nan guard events. Validation: bash -n script passed; restarted monitor service on A100 and observed nan_count=0 during active training.

Copilot

Pull request overview

This PR aims to eliminate false-positive NaN/Inf monitoring during long-running training (notably on A100) by adding an explicit NaN/Inf guard callback, refining how monitoring scripts count “real” NaN/Inf events, and introducing supporting operational tooling (systemd units + post-run utilities).

Changes:

Add NaNGuardCallback (and tests) to stop training after consecutive non-finite metric detections.
Wire training callbacks + max_grad_norm into the training orchestrator and config.
Add remote-ops scripts/services: systemd units, watchdog, monitor, manifest generator, and post-training pipeline helpers.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_callbacks.py	Adds unit tests for `NaNGuardCallback` behavior (stop + reset).
src/forge/utils/config.py	Adds `training.max_grad_norm` to the pydantic config model.
src/forge/training/trainer.py	Wires `max_grad_norm` and adds early-stopping + NaN guard callbacks to the trainer.
src/forge/training/callbacks.py	Introduces `NaNGuardCallback` + non-finite detection helper and formalizes callback base class.
scripts/training_watchdog.py	Adds a watchdog that restarts systemd training service on repeated NaN lines / stalls.
scripts/start_a100_training.sh	Adds a systemd-friendly entrypoint for A100 training runs (resume support, cache env).
scripts/set_wandb_key.sh	Helper to safely write `WANDB_API_KEY` to a user env file and restart services.
scripts/post_training_pipeline.sh	Adds a post-run pipeline (manifest → evals → merge → optional serve checks).
scripts/monitor_a100_training.sh	Adds a progress/ETA/NaN counter monitor that writes status files.
scripts/install_training_services.sh	Installs/enables systemd user services and bootstraps env file.
scripts/generate_training_manifest.py	Adds deterministic manifest generator for completed training runs.
docs/TRAINING_GUIDE.md	Links to the new roadmap document.
docs/ROADMAP.md	Adds a structured post-run roadmap and definition-of-done.
deploy/systemd/forge-training.service	Adds training systemd user unit.
deploy/systemd/forge-training-watchdog.service	Adds watchdog systemd user unit.
deploy/systemd/forge-training-monitor.service	Adds monitor systemd user unit.
configs/models/turkcell_7b_a100_v4_recovery.yaml	Adds A100 recovery profile (incl. `max_grad_norm`).
configs/models/turkcell_7b_a100_v3_clean.yaml	Adds A100 “stable” profile (incl. `max_grad_norm`).
configs/models/turkcell_7b_a100_resume_bf16_clean.yaml	Adds A100 resume bf16 profile.
README.md	Adds a short “Post-Completion Roadmap” section linking to `docs/ROADMAP.md`.
Makefile	Adds `manifest`, `post-train`, `set-wandb` targets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T07:38:00Z

+#!/usr/bin/env bash
+set -euo pipefail
+
+cd /home/weezboo/projects/LowResource-LLM-Forge


cd /home/weezboo/... hard-codes a user-specific path and will break when the repo lives elsewhere (even though the systemd unit already sets WorkingDirectory). Consider switching to a configurable PROJECT_ROOT (as in other scripts) or using the script directory to resolve paths.

Suggested change

cd /home/weezboo/projects/LowResource-LLM-Forge

PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"

cd "$PROJECT_ROOT"

Copilot · 2026-02-24T07:38:01Z

+        # Count only real NaN/Inf metric values and explicit NaN guard events.
+        metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"
+        guard_nan_count="$(grep -a -E -c "nan_guard_detected|nan_guard_stopping_training" "$LOG_FILE" || true)"
+        metric_nan_count="${metric_nan_count:-0}"
+        guard_nan_count="${guard_nan_count:-0}"
+        nan_count=$((metric_nan_count + guard_nan_count))
+    fi


nan_count will likely double-count NaN/Inf events: metric_nan_count matches any occurrence of '(loss|grad_norm|eval_loss)': ... (nan|inf) anywhere in the log, including inside bad_metrics={...} on nan_guard_detected lines, and then guard_nan_count counts those same guard lines again. To keep the counter accurate, exclude nan_guard_* lines from the metric grep (or tighten the metric-line pattern so it only matches the trainer’s metric dict lines) before adding guard_nan_count.

Copilot · 2026-02-24T07:38:01Z

+        # Count only real NaN/Inf metric values and explicit NaN guard events.
+        metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"


The metric grep only matches nan|inf but not +inf / -inf, while the Python-side _is_non_finite treats those as non-finite too. If you want monitor parity with the callback, extend the pattern to include signed inf variants.

Suggested change

# Count only real NaN/Inf metric values and explicit NaN guard events.

metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"

# Count only real NaN/Inf metric values (including +inf/-inf) and explicit NaN guard events.

metric_nan_count="$(grep -a -E -i "'(loss|grad_norm|eval_loss)':[[:space:]]*'?(nan|[+-]?inf)'?" "$LOG_FILE" | wc -l | tr -d '[:space:]' || true)"

Copilot · 2026-02-24T07:38:01Z

+        if current_metric_hash and current_metric_hash != state.last_metric_hash:
+            state.last_metric_hash = current_metric_hash
+            if "nan" in last_metric.lower():
+                state.nan_consecutive += 1
+            else:
+                state.nan_consecutive = 0


The watchdog increments nan_consecutive only when 'nan' in last_metric.lower(). Since the monitor/callback logic treats Inf as non-finite too, this misses inf, +inf, -inf events and could fail to restart on an Inf blow-up. Consider checking for both NaN and Inf (ideally with the same regex used elsewhere).

Copilot · 2026-02-24T07:38:02Z

+manifest: ## Generate training manifest (TRAIN_CONFIG/RUN_DIR/LOG_FILE)
+	@if [ -z "$(RUN_DIR)" ] || [ -z "$(LOG_FILE)" ]; then \
+		echo "Usage: make manifest TRAIN_CONFIG=<yaml> RUN_DIR=<run_dir> LOG_FILE=<train_log>"; \
+		exit 1; \
+	fi
+	uv run python scripts/generate_training_manifest.py \
+		--config "$(TRAIN_CONFIG)" \
+		--run-dir "$(RUN_DIR)" \
+		--log-file "$(LOG_FILE)"


make manifest uses $(TRAIN_CONFIG) but the argument validation only checks RUN_DIR and LOG_FILE. If a user clears TRAIN_CONFIG (or runs with an empty env var), this will fail later with a less actionable argparse error. Consider validating TRAIN_CONFIG in the same check (or defaulting it explicitly for this target).

ogulcanaydogan added 5 commits February 24, 2026 07:17

feat(training): add NaN fail-fast and A100 recovery profiles

d00be64

feat(ops): add systemd services and remote training control scripts

d82fee0

feat(pipeline): add training manifest and post-training workflow

8039592

docs: add post-completion roadmap and training guide links

3767323

fix(monitor): count only real NaN/Inf events in training log

673b1a2

Copilot AI review requested due to automatic review settings February 24, 2026 07:32

Copilot started reviewing on behalf of ogulcanaydogan February 24, 2026 07:32 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

ogulcanaydogan added 2 commits February 24, 2026 09:33

fix(monitor): fall back to generic run_training process detection

694ae7b

feat(training): harden A100 runtime and add v6 optimizer-reset recovery

250c8dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: count only real NaN/Inf events in training monitor#20

fix: count only real NaN/Inf events in training monitor#20
ogulcanaydogan wants to merge 7 commits into
mainfrom
fix/monitor-real-nan-count

ogulcanaydogan commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Copilot AI Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cd /home/weezboo/projects/LowResource-LLM-Forge
	PROJECT_ROOT="${PROJECT_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
	cd "$PROJECT_ROOT"

		# Count only real NaN/Inf metric values and explicit NaN guard events.
		metric_nan_count="$(grep -a -E -i "'(loss\|grad_norm\|eval_loss)':[[:space:]]*'?(nan\|inf)'?" "$LOG_FILE" \| wc -l \| tr -d '[:space:]' \|\| true)"

Conversation

ogulcanaydogan commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants