Skip to content

add routing confidence metric#2545

Open
faresobeid wants to merge 1 commit into
mainfrom
routing-confidence
Open

add routing confidence metric#2545
faresobeid wants to merge 1 commit into
mainfrom
routing-confidence

Conversation

@faresobeid
Copy link
Copy Markdown
Contributor

@faresobeid faresobeid commented May 18, 2026

Adds metric from stepfun3.5 flash paper to measure stability of MoE models routing for RL (and the need for router replay)
Screenshot 2026-05-18 at 18 54 43


Note

Medium Risk
Adds new MoE routing-stat accumulation and changes router forward return signatures, which could affect training correctness/perf if any call sites assume the old outputs. Changes are limited to telemetry/stat buffers and logging paths, not core loss/optimizer logic.

Overview
Adds a new MoE routing confidence metric by accumulating the selected-expert probability mass in MoE routers (TokenChoiceTopKRouter and NemotronHRouter) and tracking it via a new routing_confidence_sum buffer on MoE/LatentMoE.

Extends get_load_balance_stats() to return both max_vio and routing_confidence (and reset both), and wires the new metric into RL and SFT training logs/monitoring with appropriate distributed reduction (MAX for max_vio, mean across ranks for routing_confidence).

Reviewed by Cursor Bugbot for commit 57679b1. Bugbot is set up for automated code reviews on this repo. Configure here.

if values is None:
continue
value = values.mean()
reduce_op = dist.ReduceOp.MAX if name == "max_vio" else dist.ReduceOp.SUM
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eheheeh not a fan of this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what part? The if statement?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah dont like it but also not sure how to do better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants