Skip to content

[feat] multiplex trainer token export and agg to prime monitor on orch side#2641

Open
Jackmin801 wants to merge 6 commits into
mainfrom
feat-export-kl-n-entropy
Open

[feat] multiplex trainer token export and agg to prime monitor on orch side#2641
Jackmin801 wants to merge 6 commits into
mainfrom
feat-export-kl-n-entropy

Conversation

@Jackmin801
Copy link
Copy Markdown
Member

@Jackmin801 Jackmin801 commented May 26, 2026

Note

Medium Risk
Changes distributed export layout, cross-rank STABLE coordination, and orchestrator filesystem metrics; low production impact but affects multi-run debugging and monitoring correctness.

Overview
Multi-run token exports now land under each run’s directory (output_dir/<run_id>/token_exports/step_<run_step>/) instead of only the trainer root, with run_id / run_step stamped on micro-batches in the packer and carried through transport and the data loader.

The trainer marks export steps complete via a distributed STABLE file after each step (mark_stable at end of forward/backward), and export records gain export_step and run_id. Non-exporting CP ranks still participate in stable marking.

The orchestrator reads stable token_exports JSONL and logs aggregated entropy and mismatch_kl (mean/max over loss-masked tokens) to the monitor, one stable step at a time. Docs and unit/integration tests cover run-local paths and metrics.

Reviewed by Cursor Bugbot for commit d6bb6bb. Bugbot is set up for automated code reviews on this repo. Configure here.

@Jackmin801 Jackmin801 marked this pull request as ready for review May 27, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant