feat(glm): make next-message role marker trainable after assistant turns#66
Conversation
ApprovabilityVerdict: Needs human review This PR changes training loss mask behavior to make role-marker tokens trainable after assistant turns, which is a significant runtime change affecting model training. Additionally, there is an unresolved high-severity review comment identifying a potential bug where the intended tokens may still be excluded from the loss mask. You can customize Macroscope's approvability policy. Learn more. |
cd35340 to
f32876f
Compare
6772961 to
3bfc257
Compare
… turns GLM-4.5 / GLM-5 templates have no per-turn close token inside the assistant span — the next message's role marker (<|user|> / <|observation|>) doubles as the inference stop signal (see get_stop_token_ids). Before this change, that role marker was emitted with is_sampled=False, so build_training_sample never trained the model to predict it. At inference the natural continuation of </tool_call>\n is another <tool_call>, which is what we observed: GLM-Air SFT step_250 dumped 50-283 parallel tool_calls per turn on SWE-bench Verified (median max-tool-calls 35 for solved rollouts vs 55 for failed; pass@1 25.0%). Fix: - glm45.py / glm5.py: when the previous message is an assistant, emit the next message's role-opening token with is_sampled=True. The token stays attributed to that next message (msg_idx unchanged); only sampled_mask flips. Byte stream identical to apply_chat_template. - base.py: build_training_sample now accepts role_to_mask=None. With a renderer that populates sampled_mask, the default behaviour becomes loss_mask = sampled_mask — the renderer's per-token "model would have sampled this" signal is the truth, and role-level filtering is opt-in rather than required. Callers passing role_to_mask keep the old AND-behaviour; callers without it now train on every sampled token (including the assistant's closing role marker). system role only appears at the start of a GLM conversation, so its opener is never the closer of an assistant turn — no logic needed for it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3bfc257 to
0dbc5ff
Compare
The bridge synthesizes the next-step prompt by appending new messages (tool responses, follow-up user turns) on top of what the model actually sampled this round (``previous_completion_ids``). Tokens it emits are template scaffolding for the next generation, not model output from this step. When the model fails to sample its stop token and the bridge has to synthesize the boundary via a ``<|user|>`` / ``<|observation|>`` role-opener for the first new message, that opener stays ``is_sampled=False, is_content=False`` — even though :meth:`render` would mark the same token ``True, True`` on the SFT path. The discrepancy is intentional: render's masks describe what the model *should* produce given a complete conversation (SFT target); bridge's masks describe what it *actually* produced this step (RL signal). The RL loss operates on ``previous_completion_ids`` exclusively; bridge tokens belong to the subsequent prompt and must not be counted as sampled output by downstream mask consumers (e.g. per-step credit assignment in trajectory metrics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…body semantics
The previous commit attributed ``<|user|>`` / ``<|observation|>`` tokens
that close an assistant turn with ``is_content=True`` so the body-only
SFT path (``content_sft_roles``) would also train on them. That bled the
role-marker into ``content_mask_for_roles({"tool"})`` and
``content_token_spans_by_role()``, which downstream consumers expect to
return the actual tool response body only.
Keep ``is_sampled=True`` (the actual training signal that makes the
model learn to emit the stop marker after ``</tool_call>``) but revert
``is_content`` to ``False``: the role-opener is scaffolding, not body.
In the body-only SFT mode the assistant stop-signal training belongs to
the RL path anyway, not to SFT on tool-response bodies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.
| # sampled_mask alone gates the loss when no role filter is | ||
| # supplied. ``sampled_mask[k]`` is True here (handled by the | ||
| # branch above), so this token is trainable. | ||
| loss_mask.append(True) |
There was a problem hiding this comment.
Stop tokens still excluded from loss
High Severity
The build_training_sample function incorrectly excludes renderer-generated scaffolding tokens (like GLM's <|observation|>) from the loss_mask. Despite being sampled_mask=True, these tokens are masked out when role_to_mask filters them or when content_sft_roles is active, as the latter prioritizes is_content (which is False for scaffolding). This prevents training on these critical tokens.
Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.
There was a problem hiding this comment.
This is the natural consequence of GLM's chat-template design where the per-turn stop signal lives inside the next message's span — there is no dedicated <|im_end|>-style terminator inside the assistant turn itself. We have three knobs and they trade off:
- msg_idx attribution of
<|observation|>/<|user|>after an assistant turn. We keep it attached to the next message (the tool/user message it opens) rather than retro-attributing to the assistant. The byte stream still ends the assistant's turn at that token, but structurally it belongs to the next message — which is what every downstream consumer (message_token_spans, role-based slicing, bridge/extension code) expects. is_sampled=Trueon that opener — added by this PR. Captures the fact that the model would sample that token at inference to close its turn. This is what makes the default training path (role_to_mask=None) train the model to emit the stop.is_content=Falseon that opener — preserved from before. The token is a role marker, not message body, socontent_mask_for_roles({"tool"})andcontent_token_spans_by_role()correctly exclude it.
The two specialized modes you flag fall out of this:
role_to_mask=lambda m: m["role"] == "assistant"— by construction, "only train on assistant-attributed tokens." Stop opener is attributed to the tool message → excluded. Working as specified; if the caller wants the stop trained too, they passrole_to_mask=None(default) and rely on the sampled-only path.content_sft_roles={"tool"}— body-only SFT on tool responses, gated byis_content. Scaffold tokens (including the stop opener) are intentionallyis_content=Falseto keepcontent_token_spans_by_rolehonest. In this mixed RL-on-assistant + SFT-on-tool-body mode, the assistant stop-signal supervision belongs to the RL trajectory, not to body-only SFT.
In short, the GLM chat template can't express "this token belongs to message N but counts as message N-1's output for training" cleanly, and the choice we made here (keep structural attribution, use is_sampled as the per-token "model would sample this" signal) is the cleanest fit with the existing build_training_sample contract. Default-mode training works correctly; the filter-mode misses are explicit user opt-ins, not silent bugs.
The renderers repo has no ``[tool.ruff]`` override for ``line-length``, so CI uses ruff's default of 88. Locally, ruff was picking up the parent prime-rl workspace's ``line-length = 120`` (via parent-dir config discovery), which is why ``ruff format --check`` passed locally but failed CI. Reformatting the three files this PR touches at the repo's actual line length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikasenghaas
left a comment
There was a problem hiding this comment.
comment to code ratio is crazy on this one


Summary
GLM-4.5 / GLM-5 chat templates have no per-turn close token inside the assistant span — the next message's role marker (
<|user|>/<|observation|>/<|system|>) doubles as the inference stop signal (seeget_stop_token_ids).Before this change, that role marker was attributed to the next message with
is_sampled=False, so SFT never trained the model to emit it. WithLossMaskConfig.assistant=Trueandtool/user/system=False(the defaults), the only training signal saying "stop after</tool_call>" was thedata.pyEOS-fallback on the very last assistant of the conversation — intermediate turns got nothing. At inference the natural continuation of</tool_call>\nis another<tool_call>, which is exactly what we observe in practice.Repro
GLM-Air SFT step_250 evaluated on SWE-bench Verified (full, 500 examples × 2 rollouts) emitted 50-283 parallel tool_calls per turn in a single assistant turn. 26% of assistant turns had
len(tool_calls) > 1, with median max-tool-calls = 35 for solved rollouts vs 55 for failed. Pass@1 was 25.0%; clear correlation between high-burst rollouts and failure.Fix
When the previous message is an assistant, attribute that single role-opening token to the assistant's
msg_idxwithis_sampled=True. The byte stream is unchanged (token_idsstill matchapply_chat_template); onlymessage_indicesandsampled_maskfor that one token change.build_training_samplethen yieldsloss_mask=Truefor it (because the assistant role is trainable by default), so SFT explicitly trains the model to emit the stop token after</tool_call>/ content.Applied symmetrically in both
glm45.pyandglm5.pyfor:<|observation|>(when next message istool)<|user|>(when next message isuser)<|system|>(when next message issystem)Test plan
test_build_training_sample_ids_matchstill passes (token_ids unchanged)[user, assistant w/ tool_call, tool, user, ...], the<|observation|>and<|user|>tokens that follow the assistant haveloss_mask=True🤖 Generated with Claude Code
Note
Medium Risk
Changes which tokens receive SFT loss for GLM-4.5/5 (mask metadata only, not token_ids) and alters build_training_sample defaults; incorrect caller assumptions could raise ValueError or shift training signal.
Overview
GLM chat templates use the next message’s
<|user|>/<|observation|>as inference stop tokens after assistant tool turns, but those tokens were previously markedis_sampled=False, so typical assistant-only SFT never trained the model to emit them after</tool_call>.In
glm45.pyandglm5.py, when the prior turn isassistant, the following user/tool role opener is emitted withis_sampled=True(stillis_content=Falseand sametoken_ids).bridge_to_next_turnkeeps those openersis_sampled=Falseand documents the intentional SFT vs RL split.In
base.py,build_training_samplenow accepts optionalrole_to_mask=None: if the renderer providessampled_mask, loss followssampled_maskalone so GLM stop markers on the next message can be trained without a role filter. Callers that omitrole_to_maskon renderers withoutsampled_maskgetValueErrorinstead of a wrong mask.Reviewed by Cursor Bugbot for commit b9fae66. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Mark next-message role tokens as trainable after assistant turns in GLM renderers
<|user|>and<|observation|>role opener tokens that immediately follow an assistant message are now emitted withis_sampled=Trueinstead of alwaysFalse.build_training_samplenow accepts an optionalrole_to_mask(defaulting toNone) and usessampled_maskfrom the renderer when no role filter is supplied, raisingValueErrorif neither is available.Macroscope summarized b9fae66.