Skip to content

fix(server): parse gemma's call:<verb>{} plain-text tool emissions#323

Closed
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:fix/server-call-verb-tool-parser
Closed

fix(server): parse gemma's call:<verb>{} plain-text tool emissions#323
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:fix/server-call-verb-tool-parser

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented May 31, 2026

Summary

Adds a sixth detection pattern to server/src/server/tool_parser.cpp::parse_tool_calls that recognizes the plain-text tool invocations gemma emits in chat-completion content — e.g. call:get_country_info{country: "France"}, call:execute-bead:read-file{path: "..."}, call:default_api:analyze_data{data: [...]}. None of the existing five envelope-shaped patterns (<tool_call>, bare <function=...>, <function=...(args)>, <tool_code>, bare JSON) match this shape, so a non-trivial slice of gemma's tool-using runs surfaces text-only to clients that expect structured tool_use blocks.

Why this matters

The 2026-05-30 gemma full bench (d9ecba6cc105-nvidia-geforce-rtx-3090-ti-gemma-full-2026-05-30-67f4) scored forge 0/30 because every row's iterations[0].output carried these call:<verb>{...} invocations as text content rather than structured Anthropic tool_use blocks. forge's WorkflowRunner expects a ToolCall payload and the row dies with ValidationError.

The downstream wiring is already in place: SseEmitter::accumulate calls parse_tool_calls, a non-empty result flips finish_reason to tool_calls, the Anthropic /v1/messages branch maps that to stop_reason="tool_use" with tool_use content blocks (server/src/server/http_server.cpp:2030-2090), and the OpenAI branch maps to choices[].message.tool_calls. The only gap was the parser.

What's in the new pattern

  • Regex with sentinel anchor. (^|[\s,;:\(\[\{\}\)\]\>])call:([A-Za-z0-9_.:\-]+)\s*\{ — narrative I'll call:foo does match (whitespace before call:) but narrative.call:foo doesn't (no sentinel). } is in the sentinel set so back-to-back invocations call:a{...}call:b{...} both fire.
  • Quote- and escape-aware brace scanner. Tracks {}/[] depth, skips over "..." / '...' / `...` string literals, honours \ escapes. Handles the multi-line nested-array shape from the snapshot.
  • Relaxed-JSON arg parser. Strict json::parse first; on failure, rewrites the buffer to quote bare identifier keys and normalize single/backtick quoted strings to double-quoted, then retries. Malformed args drop only that one invocation — no crash, no false positives, surrounding calls keep working.
  • Verb normalization. call:execute-bead:read-file → ToolCall name read-file (strips everything up to and including the last :).
  • Pattern ordering. Inserted as pattern I can’t reproduce the prefill performance on the RTX 3090 #5, ahead of the bare-JSON sweep (now feat: add DFlash for Windows #6). This defends against call:outer{"name": "inner", "arguments": {}} where the bare-JSON sweep would otherwise pluck out the inner {"name": ...} as a spurious inner ToolCall. The brace span pattern I can’t reproduce the prefill performance on the RTX 3090 #5 records in removals shadows the inner JSON from pattern feat: add DFlash for Windows #6's view via the existing overlaps() check.
  • Tool-allowed filter. Re-uses the existing tool_allowed(tools, verb) path so callers that pass a constrained tool list (forge) only get back tools they declared.

The full implementation plan, including the codex review and adjustments, lives at docs/experiments/server-call-verb-tool-parser-plan.md. Codex flagged the pattern-ordering hazard during review and the plan was revised before any code was written.

Test plan

14 new C++ unit cases in server/test/test_server_unit.cpp covering:

  • Single bare call, back-to-back calls
  • Namespaced verb (execute-bead:read-fileread-file)
  • Mixed snake_case + kebab-case verbs across newlines
  • Tool-allowed filter drops disallowed verbs
  • Mid-prose narrative.call:foo{...} rejected (no sentinel char)
  • Sure, I'll call:foo{...} accepted (whitespace sentinel)
  • Malformed (unterminated brace) args → no crash, no ToolCall
  • Inner {...} and } inside string values doesn't break the scanner
  • Strict-JSON args (already-quoted keys) parse on the fast path
  • Unquoted-key args parse on the relaxed path
  • cleaned_text is properly scrubbed of the matched span
  • Codex-requested: call:outer{"name": "inner", "arguments": {}} produces exactly one ToolCall named outer, not two
  • Multi-line nested-array args mirroring default_api:analyze_data from the snapshot

All 14 pass in a standalone driver. CI will run them via the existing test_server_unit target wired in server/CMakeLists.txt.

Reverse-compat with the client-side fix

The forge client-side workaround _parse_plain_text_tool_calls shipping on feat/lucebox-docker (commit deba2fd) becomes redundant once a server with this fix is deployed. It stays in place as defense-in-depth for older deployed servers — the synthesized ToolCalls match what the server now produces, so the only cost is a few extra cycles in the client. Don't strip the Python fallback in this PR.

Out of scope

  • Docker image rebuild + e2e bench re-run. The operator should rebuild the cuda12 image and confirm forge passes non-trivially without the client-side workaround once this merges.

🤖 Generated with Claude Code

easel and others added 2 commits May 31, 2026 11:34
Captures the diagnosis (gemma forge 0/30 on 2026-05-30), the proposed
sixth detection pattern, the relaxed-JSON arg parser sketch, the
unit-test matrix, and codex's review (which forced reordering the new
pattern to slot Luce-Org#5 ahead of the bare-JSON sweep to avoid interception
of nested name/arguments-shaped args).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a sixth detection pattern to `parse_tool_calls` that recognizes
the plain-text tool invocations gemma emits in chat-completion content
(`call:get_country_info{country: "France"}` /
`call:execute-bead:read-file{path: "..."}` / etc).

The 2026-05-30 gemma full bench scored forge 0/30 because every row's
output carried these `call:<verb>{...}` invocations as text rather
than structured `tool_use` content blocks. None of the existing five
envelope-shaped detectors (`<tool_call>`, `<function=...>`,
`<tool_code>`, bare JSON) match the bare `call:` shape.

The new pattern:
- Anchors on a sentinel character (whitespace, comma, semicolon,
  open/close bracket, etc.) before `call:` so narrative usages like
  `narrative.call:foo` don't match.
- Supports namespaced verbs (`execute-bead:read-file`,
  `default_api:fetch_sales_data`) and strips the namespace before
  using the verb as the ToolCall name.
- Extracts the args block via a quote- and escape-aware balanced-brace
  scanner that tolerates `"`, `'`, and `` ` `` string literals and
  tracks `[]` depth alongside `{}`.
- Parses the args as strict JSON first, then falls back to a relaxed
  rewrite that quotes bare identifier keys and normalizes single/
  backtick quoted strings to double-quoted before retrying. Malformed
  args drop the single invocation without crashing or polluting other
  calls.
- Runs *before* the bare-JSON sweep so that inner args of the form
  `call:outer{"name": "inner", "arguments": {}}` aren't hijacked into
  a spurious `inner` ToolCall by pattern Luce-Org#6.

Downstream the existing wiring takes over: SseEmitter::accumulate
already calls parse_tool_calls; a non-empty ToolCall list flips
finish_reason to `tool_calls`, which the Anthropic /v1/messages
branch maps to `stop_reason="tool_use"` with `tool_use` content
blocks (http_server.cpp:2030-2090) and the OpenAI branch maps to
`choices[].message.tool_calls`.

The forge client-side workaround `_parse_plain_text_tool_calls`
shipping on feat/lucebox-docker (commit deba2fd) becomes redundant
once a server with this fix is deployed. It stays in place as
defense-in-depth for older deployed servers.

Test plan: 14 new C++ unit cases in test_server_unit.cpp covering
single / back-to-back / namespaced / snake- and kebab-case verbs;
tool-allowed filtering; mid-prose rejection vs. whitespace-led
acceptance; malformed args drop; inner `{}` inside string literals;
strict-JSON and relaxed-keys arg parsing; cleaned_text scrubbing;
the codex-requested inner `name`/`arguments` interception case; and
multi-line nested-array args mirroring the snapshot data. All pass
in a standalone driver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
@easel
Copy link
Copy Markdown
Collaborator Author

easel commented May 31, 2026

Folded into PR #285 — commits 12c50c0 (plan + codex review) and cdb8b9c (implementation) are now part of feat/lucebox-docker (tip 329f611). The C++ fix lands via #285 instead of as a separate PR against main.

@easel easel closed this May 31, 2026
easel added a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Documents the server-side call:<verb>{} tool parser fix (PR Luce-Org#323) and
the C++17 compatibility fix for starts_with. Benchmarks running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Smoke-testing the post-PR-Luce-Org#323 image (lucebox-hub:cuda12 @ 8039911) on
sindri's gemma-4-26b revealed a new emission pattern: the model
sometimes outputs ``_call:get_country_info{...}`` with a leading
underscore. This is a SentencePiece / chat-template tokenizer artifact
that became visible after bragi's channel-token routing fix
(commit 4b757d1) — the underscore is residual from the tokenizer's
internal serialization that earlier handling stripped.

Both parsers missed these invocations:

* Server-side (tool_parser.cpp:182): the sentinel character class
  ``[\s,;:\(\[\{\}\)\]\>]`` did not include ``_``. Added.
* Client-side (forge.py:32): ``\bcall:`` requires a word boundary
  before ``call``, but ``_`` is a word char so ``\b`` doesn't fire
  between ``_`` and ``c``. Replaced with explicit lookbehind on the
  same sentinel set (including ``_``).

Net result: ``_call:foo{...}`` now parses to a tool_use the same way
``call:foo{...}`` does. Tradeoff: ``my_call:foo{}`` mid-identifier
would also match, but real model outputs don't emit free-form
``my_call:`` text (tool names come from request tool defs).

Tests: +2 cases in test_forge_grader.py (underscore alone, mixed
back-to-back with both prefixed and bare). 16 → 18 forge_grader
tests, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant