Skip to content

fix(#2083): inject thinking-disable into title gen API calls for local reasoning models (#2083)#3944

Open
rodboev wants to merge 5 commits into
nesquena:masterfrom
rodboev:pr/gpu-idle-after-title-gen
Open

fix(#2083): inject thinking-disable into title gen API calls for local reasoning models (#2083)#3944
rodboev wants to merge 5 commits into
nesquena:masterfrom
rodboev:pr/gpu-idle-after-title-gen

Conversation

@rodboev

@rodboev rodboev commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Thinking Path

  • GPU utilization stays elevated after the user's prompt completes; terminal chat does not exhibit this. The culprit is background title generation calling reasoning-capable models (Qwen3, DeepSeek-R1) through OpenAI-compatible endpoints (LM Studio, llama.cpp) without disabling the thinking pass.
  • generate_title_raw_via_aux already injects extra_body={"reasoning": {"enabled": False}} when calling call_llm, but the generate_title_raw_via_agent OpenAI-compat else: branch had no equivalent injection, so the model runs a full reasoning pass for a one-line title.
  • PR fix(streaming): skip budget-doubling title retry for reasoning-only responses (#2083) #2107 shipped retry-skip for llm_empty_reasoning responses but never addressed the root cause: the reasoning pass itself. Injecting thinking: {type: disabled} and reasoning: {enabled: False} via extra_body before the API call suppresses the reasoning pass at the endpoint level.
  • setdefault preserves any existing per-provider extra_body entries, including the Minimax reasoning_split: True which now adds to the same dict instead of replacing it.

What Changed

  • api/streaming.py: in generate_title_raw_via_agent, injected extra_body with thinking: {type: disabled} and reasoning: {enabled: False} into api_kwargs before the Minimax block in the OpenAI-compat path; simplified the Minimax block to append reasoning_split to the same dict

Why It Matters

Local reasoning models no longer burn GPU time on a hidden thinking pass for background title generation, eliminating the post-prompt GPU spike that users reported.

Verification

$env:PYTHONUTF8 = '1'; $env:BROWSER = 'echo'
..\hermes-agent\venv\Scripts\python.exe -m pytest tests/test_issue2083_gpu_title_gen.py -v --timeout=60
..\hermes-agent\venv\Scripts\python.exe -m pytest tests/ -v --timeout=60

Risks / Follow-ups

  • The thinking and reasoning keys are OpenAI-compat conventions supported by LM Studio, llama.cpp, and vLLM; endpoints that don't recognize them silently ignore unknown extra_body keys, so this is backward-compatible.
  • If a future provider uses extra_body.thinking for a different purpose, the setdefault guard prevents overwriting their value.

Model Used

Claude Opus 4.6 via Claude Code CLI

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR injects thinking: {type: disabled} and reasoning: {enabled: False} into extra_body for the generate_title_raw_via_agent OpenAI-compat path, and refines the aux path to apply the same reasoning disable conditionally rather than unconditionally, so local reasoning models (Qwen3, DeepSeek-R1) no longer run a full thinking pass during background title generation.

  • Agent path (generate_title_raw_via_agent): new hostname + model-capability heuristic (_route_ok + resolve_model_reasoning_efforts / _reasoning_name_candidates) gates both thinking and reasoning disable injection; Minimax also receives reasoning_split: True as before.
  • Aux path (generate_title_raw_via_aux): same heuristic applied; reasoning: {enabled: False} is now conditional (was unconditional) and extra_body is passed as None instead of {} when no keys apply.
  • Tests: new suite covers skip-logic, per-route injection, Minimax compatibility, and LM Studio name-heuristic cases.

Confidence Score: 4/5

The injection logic is backward-compatible (unknown extra_body keys are silently ignored by endpoints), but the _route_ok hostname heuristic misses IPv6 loopback and private-network IPs, so some local deployments won't benefit from the fix.

The core logic — importing real capability-detection helpers and gating injection on a route allowlist — is correct and well-tested for the named scenarios. The _route_ok substring check omits ::1 and RFC-1918 addresses, meaning users who expose LM Studio or llama.cpp on a private LAN IP or IPv6 loopback silently get no reasoning-disable injection. The identical omission exists in both the agent and aux paths. No data loss or security risk, but the stated fix doesn't reach all affected setups.

The _route_ok heuristic in api/streaming.py (lines 2454 and 2578) is the only area needing a closer look — the rest of the change is straightforward.

Important Files Changed

Filename Overview
api/streaming.py Adds thinking/reasoning-disable injection to both the agent and aux title-generation paths using a hostname + model-capability heuristic; inlines the same _route_ok logic twice rather than sharing a helper.
tests/test_issue2083_gpu_title_gen.py New regression test suite covering skip-logic, per-route injection presence/absence, Minimax compatibility, and LM Studio name-heuristic paths; relies on MagicMock agents which bypasses real attribute resolution.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Title generation triggered] --> B{API mode?}
    B -- codex_responses --> C[Codex path - unchanged]
    B -- anthropic --> D[Anthropic path - unchanged]
    B -- OpenAI-compat else --> E[Build api_kwargs]
    E --> F{_is_minimax?}
    F -- yes --> G[set thinking disabled\nset reasoning disabled\nset reasoning_split True]
    F -- no --> H{_caps non-empty\nAND _route_ok?}
    H -- yes --> I[set thinking disabled\nset reasoning disabled]
    H -- no --> J{_name_heuristic\n_route_ok AND model name\nmatches reasoning pattern?}
    J -- yes --> I
    J -- no --> K[No extra_body injection]
    G --> L[api_kwargs extra_body updated]
    I --> L
    K --> M[Call completions.create]
    L --> M
    M --> N{Response has content?}
    N -- yes --> O[Return title]
    N -- no, llm_empty_reasoning --> P[Skip remaining attempts\nfall back to local title]
Loading

Reviews (4): Last reviewed commit: "Fall back to model-name heuristic for lo..." | Re-trigger Greptile

Comment thread tests/test_issue2083_gpu_title_gen.py
Comment thread api/streaming.py Outdated
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Thanks @rodboev — the fix correctly targets the right gap (background title generation pegging local reasoning models), and the setdefault approach preserving the MiniMax reasoning_split route is clean. But the Opus review pass on the brick batch caught a regression I want to fix before this ships, so I'm marking it changes-requested rather than absorbing it as-is.

The issue: the injection in generate_title_raw_via_agent adds extra_body thinking:{type:disabled} + reasoning:{enabled:False} unconditionally for every chat-completions provider. But the agent deliberately gates exactly these keys behind _supports_reasoning_extra_body() (run_agent.py), which returns False for strict direct providers — api.mistral.ai is explicitly excluded, and any non-OpenRouter/non-Nous/non-GitHub/non-LM-Studio route returns False — precisely because "some providers/routes reject reasoning with 400s."

Bypassing that gate means on a strict cloud provider every title attempt 400s, the exception is swallowed at debug level, and titles silently degrade to the heuristic fallback. Chat itself is unaffected (not a brick), but it trades the local-GPU fix for a silent title regression for direct-cloud users — and additionally sends the nonstandard thinking key to MiniMax, which previously got only reasoning_split.

Requested fix (small, ~5-10 lines): gate the injection so it only fires when reasoning extra_body is actually supported for the route. The cleanest option is to reuse the agent's own gate, e.g.:

_tg_supports = False
try:
    _tg_supports = bool(agent._supports_reasoning_extra_body())
except Exception:
    _tg_supports = False
if _tg_supports:
    _tg_extra = dict(api_kwargs.get('extra_body') or {})
    _tg_extra.setdefault('thinking', {'type': 'disabled'})
    _tg_extra.setdefault('reasoning', {'enabled': False})
    api_kwargs['extra_body'] = _tg_extra

That keeps the GPU fix for exactly the reasoning-capable local routes #2083 is about (LM Studio Qwen3/DeepSeek-R1 resolve True via the LM Studio probe) while not sending the keys to providers that 400 on them. Alternatively, catch a 400 mentioning unrecognized/extra fields and retry once without the injected keys (self-correcting across providers, also ~10 lines) — either approach is fine.

The MiniMax reasoning_split path should stay exactly as it is. Once gated, please confirm the existing test_issue2083_gpu_title_gen.py still passes and add a case asserting the keys are NOT injected for a strict/direct provider. Happy to re-gate as soon as it's pushed. 🙏

@nesquena-hermes nesquena-hermes added the changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address label Jun 10, 2026
…ty check

Non-reasoning providers (Mistral direct, plain OpenAI) would 400 on the
unconditional thinking/reasoning extra_body keys injected during title
generation.  Gate behind resolve_model_reasoning_efforts() and treat
_is_minimax_route() as an independent reasoning signal.
@rodboev

rodboev commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

The title-gen path was unconditionally injecting thinking and reasoning disable keys into extra_body for all OpenAI-compatible providers. Strict providers like Mistral direct or plain OpenAI that don't recognize these parameters would 400 on title generation and silently fall back to heuristic titles.

Fixed in 8fb4d989:

  1. Both the agent and aux title-gen paths now gate the extra_body reasoning-disable injection behind resolve_model_reasoning_efforts(), which only returns a non-empty list for models known to support reasoning (o-series, GPT-5+, Claude 3.7+, Qwen 3+, DeepSeek V/R-series, etc.).
  2. _is_minimax_route() is treated as an independent reasoning signal so Minimax models still get the full reasoning_split + disable treatment even when the model name alone doesn't trigger the heuristic.
  3. Added a negative test (test_api_kwargs_omits_reasoning_keys_for_non_reasoning_model) confirming that non-reasoning models like mistral-large get no reasoning keys in extra_body.

All 7 issue-specific tests + 41 reasoning/model-resolver tests pass locally.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Thanks @rodboev — the model-capability gating via resolve_model_reasoning_efforts() is a clean improvement over the unconditional injection, and applying it to both the agent and aux paths is right. I staged this for release and ran the full gate; it fixes the case my CR named (non-reasoning models on strict providers), but the Opus reviewer caught that the gate is route-blind and re-creates the same 400 class for a different slice, so I'm holding one more round.

The remaining issue: the gate is model-capability-based but route-blind

generate_title_raw_via_agent now injects thinking/reasoning extra_body keys whenever resolve_model_reasoning_efforts(model, ...) is non-empty — but that's true for gpt-5.x / o3 on plain OpenAI direct, not just on the local/OpenRouter routes. I verified empirically:

resolve_model_reasoning_efforts("gpt-5.5", provider_id="openai") -> ['minimal','low','medium','high','xhigh','max']   # INJECTS
resolve_model_reasoning_efforts("o3",      provider_id="openai") -> [...]                                              # INJECTS
resolve_model_reasoning_efforts("gpt-4.1", provider_id="openai") -> []                                                # ok, no inject
resolve_model_reasoning_efforts("mistral-large", provider_id="mistral") -> []                                         # ok

So vs master (which sent no extra keys on this path), this now sends thinking + reasoning body params to api.openai.com for OpenAI-direct GPT-5/o-series. The agent codebase deliberately withholds exactly these on direct routes — _supports_reasoning_extra_body() (run_agent.py) returns True only for OpenRouter/Nous/GitHub/LMStudio because "some providers/routes reject reasoning with 400s." Net effect: OpenAI-direct reasoning-model users would 400 on title-gen and silently fall back to heuristic titles — a regression vs master for a common config (bounded: chat turns unaffected, exceptions caught).

Requested fix (one line, both paths)

AND the capability gate with the agent's own route-tolerance gate, so the injection only fires for routes known to accept the keys (which still covers the local LM Studio / llama.cpp / Ollama case #2083 targets):

_route_ok = (
    callable(getattr(agent, "_supports_reasoning_extra_body", None))
    and agent._supports_reasoning_extra_body()
) or _is_local_base_url(_agent_base_url)   # or your preferred local-host check
if (_is_minimax or resolve_model_reasoning_efforts(_agent_model, provider_id=_agent_provider, base_url=_agent_base_url)) and _route_ok:
    _tg_extra.setdefault("thinking", {"type": "disabled"})
    _tg_extra.setdefault("reasoning", {"enabled": False})
    ...

(The aux path has the same shape — gate it the same way. This is essentially the route-gate my original CR pointed at; the capability check is a good addition on top, not a replacement for it.) Please re-confirm the existing #2083 test still passes and add a case asserting NO injection for a reasoning model on an OpenAI-direct route. Happy to take it straight to release once that's in — the rest is solid. 🙏

Comment thread api/streaming.py Outdated
@rodboev

rodboev commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Both title-gen paths now AND the model-capability check with route tolerance:

  1. Agent path: gates behind agent._supports_reasoning_extra_body(), so reasoning-capable models on strict direct providers (OpenAI, Mistral) skip the injection entirely.
  2. Aux path: inline route check against known-safe hosts (OpenRouter, Nous, localhost, LMStudio) since there's no agent object available.
  3. Minimax stays ungated as before — its reasoning_split path is independent.
  4. Added test_api_kwargs_omits_reasoning_keys_for_strict_direct_route confirming no injection for gpt-5.5 on api.openai.com when the route gate returns False. All 8 issue tests + 11 reasoning-effort tests pass.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Thanks @rodboev — the route-tolerance gating is exactly right and resolves the OpenAI-direct 400 regression cleanly (verified: gpt-5.5/o3 on OpenAI-direct → _supports_reasoning_extra_body() is False, no injection; OpenRouter → injects). One more issue surfaced on the regression gate, and it's an important one because it regresses the original #2083 goal for the headline case — so I'm holding one more round (I shipped #3950 separately; this one just needs this last fix).

The new gate suppresses the disable payload for LM Studio reasoning models — the exact #2083 case

The injection now requires resolve_model_reasoning_efforts(model, provider_id, base_url) to be non-empty AND a tolerant route. But resolve_model_reasoning_efforts() returns [] for LM Studio models, because LM Studio reasoning capability is probed live (on/off options) rather than living in the static effort table. I verified:

resolve_model_reasoning_efforts("qwen3-8b",   provider_id="lmstudio") -> []   # gate SUPPRESSES
resolve_model_reasoning_efforts("deepseek-r1", provider_id="lmstudio") -> []   # gate SUPPRESSES

So on the aux path, @lmstudio:qwen3-8b now sends extra_body=None, where master sent {"reasoning": {"enabled": False}}. Net effect: the reasoning-disable payload is no longer sent for LM Studio Qwen3/DeepSeek-R1 — which is precisely the local-GPU-burn case #2083 is about. The capability gate is doing the opposite of what we want for the route the issue targets.

Requested fix (both aux + agent gates)

When the route is tolerant/local (LM Studio, localhost/127.0.0.1, llama.cpp), fall back to a model-name reasoning heuristic when resolve_model_reasoning_efforts() is empty — so a known reasoning-model name (qwen3*, deepseek-r*/deepseek-v*, *-reasoner, etc.) still gets the disable payload on those routes. Something like:

_caps = resolve_model_reasoning_efforts(model, provider_id=provider, base_url=base_url)
_local_reasoning_name = _route_ok and not _caps and _looks_like_reasoning_model(model)  # name heuristic
if _is_minimax or (_caps and _route_ok) or _local_reasoning_name:
    ...inject...

Apply it symmetrically to both generate_title_raw_via_aux and generate_title_raw_via_agent, and add a test asserting @lmstudio:qwen3-8b (and a DeepSeek-R local) DO get the reasoning-disable payload while OpenAI-direct GPT-5 still does NOT. That closes the loop: strict clouds stay clean (the fix you just made), and the local reasoning models #2083 reported get the payload again.

Really close now — this is the last edge. Reopen for re-gate when it's in. 🙏

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Checked the commit pushed after my last round (40da744e). Read the new agent + aux gates in api/streaming.py at HEAD and re-read the agent's _supports_reasoning_extra_body() in run_agent.py:4676-4716. CI is now green across all shards (3.11/3.12/3.13). Two things:

The LM Studio regression is fixed — confirmed

The name-heuristic fallback closes the #2083 headline case. For @lmstudio:qwen3-8b, resolve_model_reasoning_efforts() returns [], but _candidate_supports_reasoning("qwen3-8b") now matches the Qwen-3+ branch (api/config.py:2442-2448) and deepseek-r1 matches the DeepSeek R-series branch, so the disable payload fires again on the aux path. That's the right outcome and matches what I asked for.

But the agent path abandoned the canonical gate it had one commit ago

My 22:17 review endorsed agent._supports_reasoning_extra_body() specifically because it already gets the LM Studio case right. 40da744e replaced it with an inline hostname allowlist (api/streaming.py:2578):

_route_ok = any(h in _agent_base_lower for h in ('openrouter', 'nousresearch.com', 'localhost', '127.0.0.1', '0.0.0.0')) or (_agent_provider or '').strip().lower() == 'lmstudio'

The agent's real gate (run_agent.py:4676-4716) is richer than this allowlist in two ways the reimplementation drops:

  1. GitHub Models routes. The agent gate special-cases models.github.ai / api.githubcopilot.com via github_model_reasoning_efforts(self.model) (hermes_cli/models.py:3251). The inline allowlist has no GitHub host, so GitHub-routed reasoning models never get the disable payload. Not a regression vs master (master sent nothing here), but it's a coverage gap the endorsed gate would have covered for free.

  2. LM Studio live capability vs static name match. The agent gate probes LM Studio's published allowed_options (_lmstudio_reasoning_options_cached()lmstudio_model_reasoning_options, hermes_cli/models.py:2953) and returns any(opt and opt != "off"). The reimplementation trusts the model name instead. For a name that looks reasoning-capable but is served by an off-only LM Studio build, the heuristic sends reasoning:{enabled:False} where the live probe would have said "no real reasoning, skip." Harmless on LM Studio (ignores the key), but it's now a hand-rolled approximation of a probe the agent already owns.

Recommendation: keep the agent gate on the agent path, drop only the over-restrictive conjunction

The original 400-regression I flagged at 21:53 came from the resolve_model_reasoning_efforts(...) AND _route_ok conjunction over-firing on OpenAI-direct GPT-5/o3 — not from _supports_reasoning_extra_body() itself. _supports_reasoning_extra_body() alone already returns the correct answer for every case in this thread: OpenAI-direct → False (no openrouter in base_url, provider≠lmstudio) → no inject; LM Studio qwen3 → live probe ["off","on"] → True → inject; OpenRouter reasoning models → True. So the minimal agent-path gate is just:

_route_ok = (
    callable(getattr(agent, '_supports_reasoning_extra_body', None))
    and agent._supports_reasoning_extra_body()
)
if _is_minimax or _route_ok:
    _tg_extra.setdefault('thinking', {'type': 'disabled'})
    _tg_extra.setdefault('reasoning', {'enabled': False})

That recovers GitHub coverage and the LM Studio live probe without re-introducing the OpenAI-direct 400, and keeps WebUI from drifting against the agent's gate over time. The name-heuristic fallback is still the right call for the aux path (generate_title_raw_via_aux), where there's no agent object to consult — only the agent path has the cleaner option available.

Functionally the PR is correct and shippable as-is; this is a "use the gate you already have" maintainability note, not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changes-requested Maintainer left detailed feedback requesting changes; PR is waiting on author to address

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants