feat(glm): make next-message role marker trainable after assistant turns by hallerite · Pull Request #66 · PrimeIntellect-ai/renderers

hallerite · 2026-05-26T19:00:49Z

Summary

Before this change, that role marker was attributed to the next message with is_sampled=False, so SFT never trained the model to emit it. With LossMaskConfig.assistant=True and tool/user/system=False (the defaults), the only training signal saying "stop after </tool_call>" was the data.py EOS-fallback on the very last assistant of the conversation — intermediate turns got nothing. At inference the natural continuation of </tool_call>\n is another <tool_call>, which is exactly what we observe in practice.

Repro

GLM-Air SFT step_250 evaluated on SWE-bench Verified (full, 500 examples × 2 rollouts) emitted 50-283 parallel tool_calls per turn in a single assistant turn. 26% of assistant turns had len(tool_calls) > 1, with median max-tool-calls = 35 for solved rollouts vs 55 for failed. Pass@1 was 25.0%; clear correlation between high-burst rollouts and failure.

Fix

When the previous message is an assistant, attribute that single role-opening token to the assistant's msg_idx with is_sampled=True. The byte stream is unchanged (token_ids still match apply_chat_template); only message_indices and sampled_mask for that one token change. build_training_sample then yields loss_mask=True for it (because the assistant role is trainable by default), so SFT explicitly trains the model to emit the stop token after </tool_call> / content.

Applied symmetrically in both glm45.py and glm5.py for:

<|observation|> (when next message is tool)
<|user|> (when next message is user)
<|system|> (when next message is system)

Test plan

test_build_training_sample_ids_match still passes (token_ids unchanged)
New unit test: in a conversation [user, assistant w/ tool_call, tool, user, ...], the <|observation|> and <|user|> tokens that follow the assistant have loss_mask=True
Re-SFT GLM-Air on INTELLECT-5-SFT-Processed and re-evaluate on SWE-bench Verified; expect multi-tool-call ratio to drop sharply and pass@1 to increase from 25.0%

🤖 Generated with Claude Code

Note

Medium Risk
Changes which tokens receive SFT loss for GLM-4.5/5 (mask metadata only, not token_ids) and alters build_training_sample defaults; incorrect caller assumptions could raise ValueError or shift training signal.

Overview
GLM chat templates use the next message’s <|user|> / <|observation|> as inference stop tokens after assistant tool turns, but those tokens were previously marked is_sampled=False, so typical assistant-only SFT never trained the model to emit them after </tool_call>.

In glm45.py and glm5.py, when the prior turn is assistant, the following user/tool role opener is emitted with is_sampled=True (still is_content=False and same token_ids). bridge_to_next_turn keeps those openers is_sampled=False and documents the intentional SFT vs RL split.

In base.py, build_training_sample now accepts optional role_to_mask=None: if the renderer provides sampled_mask, loss follows sampled_mask alone so GLM stop markers on the next message can be trained without a role filter. Callers that omit role_to_mask on renderers without sampled_mask get ValueError instead of a wrong mask.

^{Reviewed by Cursor Bugbot for commit b9fae66. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Mark next-message role tokens as trainable after assistant turns in GLM renderers

In glm45.py and glm5.py, the <|user|> and <|observation|> role opener tokens that immediately follow an assistant message are now emitted with is_sampled=True instead of always False.
In renderers/base.py, build_training_sample now accepts an optional role_to_mask (defaulting to None) and uses sampled_mask from the renderer when no role filter is supplied, raising ValueError if neither is available.
Behavioral Change: loss mask semantics for GLM models now include the next-message role marker token in the trainable region when it follows an assistant turn.

^{Macroscope summarized b9fae66.}

macroscopeapp · 2026-05-26T19:05:22Z

Approvability

Verdict: Needs human review

This PR changes training loss mask behavior to make role-marker tokens trainable after assistant turns, which is a significant runtime change affecting model training. Additionally, there is an unresolved high-severity review comment identifying a potential bug where the intended tokens may still be excluded from the loss mask.

^{You can customize Macroscope's approvability policy. Learn more.}

… turns GLM-4.5 / GLM-5 templates have no per-turn close token inside the assistant span — the next message's role marker (<|user|> / <|observation|>) doubles as the inference stop signal (see get_stop_token_ids). Before this change, that role marker was emitted with is_sampled=False, so build_training_sample never trained the model to predict it. At inference the natural continuation of </tool_call>\n is another <tool_call>, which is what we observed: GLM-Air SFT step_250 dumped 50-283 parallel tool_calls per turn on SWE-bench Verified (median max-tool-calls 35 for solved rollouts vs 55 for failed; pass@1 25.0%). Fix: - glm45.py / glm5.py: when the previous message is an assistant, emit the next message's role-opening token with is_sampled=True. The token stays attributed to that next message (msg_idx unchanged); only sampled_mask flips. Byte stream identical to apply_chat_template. - base.py: build_training_sample now accepts role_to_mask=None. With a renderer that populates sampled_mask, the default behaviour becomes loss_mask = sampled_mask — the renderer's per-token "model would have sampled this" signal is the truth, and role-level filtering is opt-in rather than required. Callers passing role_to_mask keep the old AND-behaviour; callers without it now train on every sampled token (including the assistant's closing role marker). system role only appears at the start of a GLM conversation, so its opener is never the closer of an assistant turn — no logic needed for it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bridge synthesizes the next-step prompt by appending new messages (tool responses, follow-up user turns) on top of what the model actually sampled this round (``previous_completion_ids``). Tokens it emits are template scaffolding for the next generation, not model output from this step. When the model fails to sample its stop token and the bridge has to synthesize the boundary via a ``<|user|>`` / ``<|observation|>`` role-opener for the first new message, that opener stays ``is_sampled=False, is_content=False`` — even though :meth:`render` would mark the same token ``True, True`` on the SFT path. The discrepancy is intentional: render's masks describe what the model *should* produce given a complete conversation (SFT target); bridge's masks describe what it *actually* produced this step (RL signal). The RL loss operates on ``previous_completion_ids`` exclusively; bridge tokens belong to the subsequent prompt and must not be counted as sampled output by downstream mask consumers (e.g. per-step credit assignment in trajectory metrics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…body semantics The previous commit attributed ``<|user|>`` / ``<|observation|>`` tokens that close an assistant turn with ``is_content=True`` so the body-only SFT path (``content_sft_roles``) would also train on them. That bled the role-marker into ``content_mask_for_roles({"tool"})`` and ``content_token_spans_by_role()``, which downstream consumers expect to return the actual tool response body only. Keep ``is_sampled=True`` (the actual training signal that makes the model learn to emit the stop marker after ``</tool_call>``) but revert ``is_content`` to ``False``: the role-opener is scaffolding, not body. In the body-only SFT mode the assistant stop-signal training belongs to the RL path anyway, not to SFT on tool-response bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.}

cursor · 2026-05-26T20:44:21Z

+            # sampled_mask alone gates the loss when no role filter is
+            # supplied. ``sampled_mask[k]`` is True here (handled by the
+            # branch above), so this token is trainable.
+            loss_mask.append(True)


Stop tokens still excluded from loss

High Severity

The build_training_sample function incorrectly excludes renderer-generated scaffolding tokens (like GLM's <|observation|>) from the loss_mask. Despite being sampled_mask=True, these tokens are masked out when role_to_mask filters them or when content_sft_roles is active, as the latter prioritizes is_content (which is False for scaffolding). This prevents training on these critical tokens.

^{Reviewed by Cursor Bugbot for commit cb07cc4. Configure here.}

This is the natural consequence of GLM's chat-template design where the per-turn stop signal lives inside the next message's span — there is no dedicated <|im_end|>-style terminator inside the assistant turn itself. We have three knobs and they trade off:

msg_idx attribution of <|observation|> / <|user|> after an assistant turn. We keep it attached to the next message (the tool/user message it opens) rather than retro-attributing to the assistant. The byte stream still ends the assistant's turn at that token, but structurally it belongs to the next message — which is what every downstream consumer (message_token_spans, role-based slicing, bridge/extension code) expects.

is_sampled=True on that opener — added by this PR. Captures the fact that the model would sample that token at inference to close its turn. This is what makes the default training path (role_to_mask=None) train the model to emit the stop.

is_content=False on that opener — preserved from before. The token is a role marker, not message body, so content_mask_for_roles({"tool"}) and content_token_spans_by_role() correctly exclude it.

The two specialized modes you flag fall out of this:

role_to_mask=lambda m: m["role"] == "assistant" — by construction, "only train on assistant-attributed tokens." Stop opener is attributed to the tool message → excluded. Working as specified; if the caller wants the stop trained too, they pass role_to_mask=None (default) and rely on the sampled-only path.

content_sft_roles={"tool"} — body-only SFT on tool responses, gated by is_content. Scaffold tokens (including the stop opener) are intentionally is_content=False to keep content_token_spans_by_role honest. In this mixed RL-on-assistant + SFT-on-tool-body mode, the assistant stop-signal supervision belongs to the RL trajectory, not to body-only SFT.

In short, the GLM chat template can't express "this token belongs to message N but counts as message N-1's output for training" cleanly, and the choice we made here (keep structural attribution, use is_sampled as the per-token "model would sample this" signal) is the cleanest fit with the existing build_training_sample contract. Default-mode training works correctly; the filter-mode misses are explicit user opt-ins, not silent bugs.

The renderers repo has no ``[tool.ruff]`` override for ``line-length``, so CI uses ruff's default of 88. Locally, ruff was picking up the parent prime-rl workspace's ``line-length = 120`` (via parent-dir config discovery), which is why ``ruff format --check`` passed locally but failed CI. Reformatting the three files this PR touches at the repo's actual line length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mikasenghaas

comment to code ratio is crazy on this one