Skip to content

feat(glm): make next-message role marker trainable after assistant turns#66

Merged
hallerite merged 4 commits into
mainfrom
feat/glm-per-turn-close-token
May 26, 2026
Merged

feat(glm): make next-message role marker trainable after assistant turns#66
hallerite merged 4 commits into
mainfrom
feat/glm-per-turn-close-token

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 26, 2026

Summary

GLM-4.5 / GLM-5 chat templates have no per-turn close token inside the assistant span — the next message's role marker (<|user|> / <|observation|> / <|system|>) doubles as the inference stop signal (see get_stop_token_ids).

Before this change, that role marker was attributed to the next message with is_sampled=False, so SFT never trained the model to emit it. With LossMaskConfig.assistant=True and tool/user/system=False (the defaults), the only training signal saying "stop after </tool_call>" was the data.py EOS-fallback on the very last assistant of the conversation — intermediate turns got nothing. At inference the natural continuation of </tool_call>\n is another <tool_call>, which is exactly what we observe in practice.

Repro

GLM-Air SFT step_250 evaluated on SWE-bench Verified (full, 500 examples × 2 rollouts) emitted 50-283 parallel tool_calls per turn in a single assistant turn. 26% of assistant turns had len(tool_calls) > 1, with median max-tool-calls = 35 for solved rollouts vs 55 for failed. Pass@1 was 25.0%; clear correlation between high-burst rollouts and failure.

Fix

When the previous message is an assistant, attribute that single role-opening token to the assistant's msg_idx with is_sampled=True. The byte stream is unchanged (token_ids still match apply_chat_template); only message_indices and sampled_mask for that one token change. build_training_sample then yields loss_mask=True for it (because the assistant role is trainable by default), so SFT explicitly trains the model to emit the stop token after </tool_call> / content.

Applied symmetrically in both glm45.py and glm5.py for:

  • <|observation|> (when next message is tool)
  • <|user|> (when next message is user)
  • <|system|> (when next message is system)

Test plan

  • test_build_training_sample_ids_match still passes (token_ids unchanged)
  • New unit test: in a conversation [user, assistant w/ tool_call, tool, user, ...], the <|observation|> and <|user|> tokens that follow the assistant have loss_mask=True
  • Re-SFT GLM-Air on INTELLECT-5-SFT-Processed and re-evaluate on SWE-bench Verified; expect multi-tool-call ratio to drop sharply and pass@1 to increase from 25.0%

🤖 Generated with Claude Code


Note

Medium Risk
Changes which tokens receive SFT loss for GLM-4.5/5 (mask metadata only, not token_ids) and alters build_training_sample defaults; incorrect caller assumptions could raise ValueError or shift training signal.

Overview
GLM chat templates use the next message’s <|user|> / <|observation|> as inference stop tokens after assistant tool turns, but those tokens were previously marked is_sampled=False, so typical assistant-only SFT never trained the model to emit them after </tool_call>.

In glm45.py and glm5.py, when the prior turn is assistant, the following user/tool role opener is emitted with is_sampled=True (still is_content=False and same token_ids). bridge_to_next_turn keeps those openers is_sampled=False and documents the intentional SFT vs RL split.

In base.py, build_training_sample now accepts optional role_to_mask=None: if the renderer provides sampled_mask, loss follows sampled_mask alone so GLM stop markers on the next message can be trained without a role filter. Callers that omit role_to_mask on renderers without sampled_mask get ValueError instead of a wrong mask.

Reviewed by Cursor Bugbot for commit b9fae66. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Mark next-message role tokens as trainable after assistant turns in GLM renderers

  • In glm45.py and glm5.py, the <|user|> and <|observation|> role opener tokens that immediately follow an assistant message are now emitted with is_sampled=True instead of always False.
  • In renderers/base.py, build_training_sample now accepts an optional role_to_mask (defaulting to None) and uses sampled_mask from the renderer when no role filter is supplied, raising ValueError if neither is available.
  • Behavioral Change: loss mask semantics for GLM models now include the next-message role marker token in the trainable region when it follows an assistant turn.

Macroscope summarized b9fae66.

Comment thread renderers/glm45.py Outdated
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 26, 2026

Approvability

Verdict: Needs human review

This PR changes training loss mask behavior to make role-marker tokens trainable after assistant turns, which is a significant runtime change affecting model training. Additionally, there is an unresolved high-severity review comment identifying a potential bug where the intended tokens may still be excluded from the loss mask.

You can customize Macroscope's approvability policy. Learn more.

@hallerite hallerite force-pushed the feat/glm-per-turn-close-token branch 2 times, most recently from cd35340 to f32876f Compare May 26, 2026 19:39
Comment thread renderers/base.py
Comment thread renderers/base.py
@hallerite hallerite force-pushed the feat/glm-per-turn-close-token branch 2 times, most recently from 6772961 to 3bfc257 Compare May 26, 2026 19:44
… turns

GLM-4.5 / GLM-5 templates have no per-turn close token inside the assistant
span — the next message's role marker (<|user|> / <|observation|>) doubles
as the inference stop signal (see get_stop_token_ids).

Before this change, that role marker was emitted with is_sampled=False, so
build_training_sample never trained the model to predict it. At inference
the natural continuation of </tool_call>\n is another <tool_call>, which
is what we observed: GLM-Air SFT step_250 dumped 50-283 parallel tool_calls
per turn on SWE-bench Verified (median max-tool-calls 35 for solved
rollouts vs 55 for failed; pass@1 25.0%).

Fix:
- glm45.py / glm5.py: when the previous message is an assistant, emit the
  next message's role-opening token with is_sampled=True. The token stays
  attributed to that next message (msg_idx unchanged); only sampled_mask
  flips. Byte stream identical to apply_chat_template.
- base.py: build_training_sample now accepts role_to_mask=None. With a
  renderer that populates sampled_mask, the default behaviour becomes
  loss_mask = sampled_mask — the renderer's per-token "model would have
  sampled this" signal is the truth, and role-level filtering is opt-in
  rather than required. Callers passing role_to_mask keep the old
  AND-behaviour; callers without it now train on every sampled token
  (including the assistant's closing role marker).

system role only appears at the start of a GLM conversation, so its
opener is never the closer of an assistant turn — no logic needed for it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/glm-per-turn-close-token branch from 3bfc257 to 0dbc5ff Compare May 26, 2026 19:58
Comment thread renderers/glm45.py Outdated
The bridge synthesizes the next-step prompt by appending new messages
(tool responses, follow-up user turns) on top of what the model actually
sampled this round (``previous_completion_ids``). Tokens it emits are
template scaffolding for the next generation, not model output from this
step. When the model fails to sample its stop token and the bridge has
to synthesize the boundary via a ``<|user|>`` / ``<|observation|>``
role-opener for the first new message, that opener stays
``is_sampled=False, is_content=False`` — even though :meth:`render` would
mark the same token ``True, True`` on the SFT path.

The discrepancy is intentional: render's masks describe what the model
*should* produce given a complete conversation (SFT target); bridge's
masks describe what it *actually* produced this step (RL signal). The
RL loss operates on ``previous_completion_ids`` exclusively; bridge
tokens belong to the subsequent prompt and must not be counted as
sampled output by downstream mask consumers (e.g. per-step credit
assignment in trajectory metrics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread renderers/glm45.py
…body semantics

The previous commit attributed ``<|user|>`` / ``<|observation|>`` tokens
that close an assistant turn with ``is_content=True`` so the body-only
SFT path (``content_sft_roles``) would also train on them. That bled the
role-marker into ``content_mask_for_roles({"tool"})`` and
``content_token_spans_by_role()``, which downstream consumers expect to
return the actual tool response body only.

Keep ``is_sampled=True`` (the actual training signal that makes the
model learn to emit the stop marker after ``</tool_call>``) but revert
``is_content`` to ``False``: the role-opener is scaffolding, not body.
In the body-only SFT mode the assistant stop-signal training belongs to
the RL path anyway, not to SFT on tool-response bodies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.

Comment thread renderers/base.py
# sampled_mask alone gates the loss when no role filter is
# supplied. ``sampled_mask[k]`` is True here (handled by the
# branch above), so this token is trainable.
loss_mask.append(True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop tokens still excluded from loss

High Severity

The build_training_sample function incorrectly excludes renderer-generated scaffolding tokens (like GLM's <|observation|>) from the loss_mask. Despite being sampled_mask=True, these tokens are masked out when role_to_mask filters them or when content_sft_roles is active, as the latter prioritizes is_content (which is False for scaffolding). This prevents training on these critical tokens.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the natural consequence of GLM's chat-template design where the per-turn stop signal lives inside the next message's span — there is no dedicated <|im_end|>-style terminator inside the assistant turn itself. We have three knobs and they trade off:

  1. msg_idx attribution of <|observation|> / <|user|> after an assistant turn. We keep it attached to the next message (the tool/user message it opens) rather than retro-attributing to the assistant. The byte stream still ends the assistant's turn at that token, but structurally it belongs to the next message — which is what every downstream consumer (message_token_spans, role-based slicing, bridge/extension code) expects.
  2. is_sampled=True on that opener — added by this PR. Captures the fact that the model would sample that token at inference to close its turn. This is what makes the default training path (role_to_mask=None) train the model to emit the stop.
  3. is_content=False on that opener — preserved from before. The token is a role marker, not message body, so content_mask_for_roles({"tool"}) and content_token_spans_by_role() correctly exclude it.

The two specialized modes you flag fall out of this:

  • role_to_mask=lambda m: m["role"] == "assistant" — by construction, "only train on assistant-attributed tokens." Stop opener is attributed to the tool message → excluded. Working as specified; if the caller wants the stop trained too, they pass role_to_mask=None (default) and rely on the sampled-only path.
  • content_sft_roles={"tool"} — body-only SFT on tool responses, gated by is_content. Scaffold tokens (including the stop opener) are intentionally is_content=False to keep content_token_spans_by_role honest. In this mixed RL-on-assistant + SFT-on-tool-body mode, the assistant stop-signal supervision belongs to the RL trajectory, not to body-only SFT.

In short, the GLM chat template can't express "this token belongs to message N but counts as message N-1's output for training" cleanly, and the choice we made here (keep structural attribution, use is_sampled as the per-token "model would sample this" signal) is the cleanest fit with the existing build_training_sample contract. Default-mode training works correctly; the filter-mode misses are explicit user opt-ins, not silent bugs.

The renderers repo has no ``[tool.ruff]`` override for ``line-length``, so
CI uses ruff's default of 88. Locally, ruff was picking up the parent
prime-rl workspace's ``line-length = 120`` (via parent-dir config
discovery), which is why ``ruff format --check`` passed locally but
failed CI. Reformatting the three files this PR touches at the repo's
actual line length.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment to code ratio is crazy on this one

@hallerite hallerite merged commit 1d69232 into main May 26, 2026
11 checks passed
@hallerite hallerite deleted the feat/glm-per-turn-close-token branch May 26, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants