Skip to content

fix(context): broaden streaming stale-compressor guard to any model-window mismatch#4618

Closed
allenliang2022 wants to merge 1 commit into
nesquena:masterfrom
allenliang2022:fix/streaming-broaden-stale-compressor-guard
Closed

fix(context): broaden streaming stale-compressor guard to any model-window mismatch#4618
allenliang2022 wants to merge 1 commit into
nesquena:masterfrom
allenliang2022:fix/streaming-broaden-stale-compressor-guard

Conversation

@allenliang2022

Copy link
Copy Markdown
Contributor

Problem

The live-usage SSE snapshot (api/streaming.py, _run_agent_streaming) emits context_length from agent.context_compressor.context_length. The #3256 "default-only guard" only undoes a stale value when it exactly equals the config global cap (model.context_length):

if _cl_u > 0 and _cc_cl_u == _cl_u and _def_u and _sm_u and not _mmcd_u(...):

But a compressor can hold a different model's window, not the config cap. Concretely: a session on claude-opus-4.8 (Copilot: 1M context / 936k prompt) whose compressor was seeded/last-updated with claude-opus-4.5's 168000 window. Since 168000 != model.context_length, the == check is false and the stale 168k passes straight through to every done event.

User-visible symptom

  • Refresh (GET /api/session hydration via _resolve_context_length_for_session_model) → shows the correct 1.0M
  • Send a message → the done SSE re-pushes the compressor's stale 168k → indicator reverts to 168.0k, and "Auto-compress at X" fires far too early (142.8k = 168k×0.85).

i.e. "refresh shows 1M, the moment I send a message it drops to 168k." Hydration and streaming diverge.

Fix

Broaden the guard: always resolve the real per-model window for the agent's current model and surface it whenever it differs from the compressor's cached value (_real_u and _real_u != _cc_cl_u).

Resolution reuses the same helper hydration usesroutes._context_length_lookup_inputs_for_model + get_model_context_length — so the streaming/SSE path and GET /api/session resolve to the identical value. This honors nested per-model config overrides (model.<provider>.models.<model>.context_length) and custom-provider keys.

Reusing the helper (instead of hand-reading the flat top-level model.context_length, which is None under nested per-model config) is deliberate: a hand-read would fall through to the live catalog and return 936k while hydration returns 1M, creating a new "refresh 1M / send 936k" mismatch.

Notes

  • The per-stream _real_ctx_cache from the existing perf commit is preserved — the lookup still runs at most once per stream, not per metering tick (per-tick resolution previously froze non-default-model streams).
  • Backend-only, no frontend changes.
  • Threshold rescaling (threshold_tokens * real/orig) is unchanged and now also benefits the broadened case.

Verification

Verified end-to-end on a real claude-opus-4.8 / copilot turn (source build, source WebUI): the live indicator went 168.0k → 1.0M and no longer reverts on subsequent messages; the auto-compress marker rescaled to the real window.

…indow mismatch

The live-usage snapshot's nesquena#3256 default-only guard only corrected the
compressor's cached context_length when it exactly equalled the config
global cap (model.context_length). A compressor left holding a *different*
model's window — e.g. a claude-opus-4.8 (1M) session whose compressor was
seeded/last-updated with claude-opus-4.5's 168k — does not satisfy that
== check, so the stale 168k passed straight through to every 'done' event.
Symptom: refresh (GET /api/session hydration) shows the correct 1M, but
sending a message reverts the indicator to 168k, and the auto-compress
marker fires far too early.

Broaden the guard: always resolve the real per-model window for the agent's
CURRENT model and surface it whenever it differs from the compressor's
cached value. Resolution reuses the SAME helper hydration uses
(routes._context_length_lookup_inputs_for_model + get_model_context_length)
so the streaming/SSE path and GET /api/session land on the IDENTICAL value,
honoring nested per-model config overrides
(model.<provider>.models.<model>.context_length) and custom-provider keys.
Reusing the helper (instead of hand-reading the flat top-level
model.context_length, which is None under nested config) avoids a new
'refresh 1M / send-a-message 936k' mismatch.

The per-stream _real_ctx_cache (resolve at most once per stream, not per
metering tick) is preserved. Backend-only, no frontend changes.

Verified end-to-end on a real opus-4.8/copilot turn: indicator 168.0k -> 1.0M
and no longer reverts on subsequent messages.
@greptile-apps

greptile-apps Bot commented Jun 21, 2026

Copy link
Copy Markdown

Greptile Summary

This PR broadens the streaming stale-compressor guard in api/streaming.py to fix a divergence between the hydration path (GET /api/session) and the live SSE path. The old guard only corrected the compressor's cached context_length when it exactly equalled the global config cap; values seeded from a different model's window (e.g., 168k from claude-opus-4.5 while the agent is now on a 1M-window model) bypassed the check and surfaced incorrect context windows on every done event.

  • Broadened guard: the new code always resolves the real per-model context window (using the same _context_length_lookup_inputs_for_model + get_model_context_length helpers that hydration already uses), and replaces the compressor's cached value whenever they disagree, eliminating the "refresh shows 1M, send drops to 168k" symptom.
  • Performance preserved: the resolution still runs at most once per stream via _real_ctx_cache, so per-tick cost is unchanged; a TypeError path provides backwards-compatibility for older hermes-agent builds that only accept the 2-arg form of get_model_context_length.

Confidence Score: 4/5

Safe to merge. The fix correctly aligns the streaming SSE path with the hydration path and eliminates the 'refresh shows 1M, send drops to 168k' regression without touching any frontend code.

The broadened guard logic is sound: it resolves the real per-model window once per stream via the same helpers that hydration uses, applies the correction only on a genuine mismatch, and rescales threshold_tokens proportionally. The two findings are both non-blocking quality concerns — an overly broad TypeError catch that could mask future signature mismatches in helper calls, and a redundant re-read of the compressor's context_length that duplicates the earlier _cc_cl_u capture. Neither affects correctness for any known input.

Only api/streaming.py changed. The except TypeError block around lines 5793-5803 is worth a second look to ensure future changes to _context_length_lookup_inputs_for_model's signature don't silently fall back to the legacy 2-arg path.

Important Files Changed

Filename Overview
api/streaming.py Core streaming logic updated: stale-compressor guard broadened to correct any model-window mismatch, not just exact matches against the global cap. Resolution is cached once per stream for performance. Logic is sound with one minor scope concern in the TypeError handler.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant FE as Frontend
    participant SR as streaming.py (_run_agent_streaming)
    participant CC as ContextCompressor
    participant RTE as routes._context_length_lookup_inputs_for_model
    participant MM as agent.model_metadata.get_model_context_length
    participant Cache as _real_ctx_cache

    FE->>SR: Send message → new stream opens
    SR->>Cache: Initialize [None]

    loop Metering tick
        SR->>CC: read context_length (_cc_cl_u)
        alt Cache is None (first tick only)
            SR->>RTE: _cli_u(model, provider, base_url, api_key, cfg)
            RTE-->>SR: _ContextLengthLookupInputs (_lk_u)
            SR->>MM: get_model_context_length(model, base_url, api_key, config_context_length, ...)
            MM-->>SR: _real_u (real per-model window)
            alt "_real_u != _cc_cl_u (mismatch)"
                SR->>Cache: store _real_u
            else values match
                SR->>Cache: store 0 (no correction needed)
            end
        end
        alt "Cache[0] > 0 (correction cached)"
            SR->>SR: "override context_length = cache[0]"
            SR->>SR: rescale threshold_tokens proportionally
        else no correction
            SR->>SR: use compressor's raw context_length
        end
        SR-->>FE: SSE done event with corrected context_length
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant FE as Frontend
    participant SR as streaming.py (_run_agent_streaming)
    participant CC as ContextCompressor
    participant RTE as routes._context_length_lookup_inputs_for_model
    participant MM as agent.model_metadata.get_model_context_length
    participant Cache as _real_ctx_cache

    FE->>SR: Send message → new stream opens
    SR->>Cache: Initialize [None]

    loop Metering tick
        SR->>CC: read context_length (_cc_cl_u)
        alt Cache is None (first tick only)
            SR->>RTE: _cli_u(model, provider, base_url, api_key, cfg)
            RTE-->>SR: _ContextLengthLookupInputs (_lk_u)
            SR->>MM: get_model_context_length(model, base_url, api_key, config_context_length, ...)
            MM-->>SR: _real_u (real per-model window)
            alt "_real_u != _cc_cl_u (mismatch)"
                SR->>Cache: store _real_u
            else values match
                SR->>Cache: store 0 (no correction needed)
            end
        end
        alt "Cache[0] > 0 (correction cached)"
            SR->>SR: "override context_length = cache[0]"
            SR->>SR: rescale threshold_tokens proportionally
        else no correction
            SR->>SR: use compressor's raw context_length
        end
        SR-->>FE: SSE done event with corrected context_length
    end
Loading

Comments Outside Diff (1)

  1. api/streaming.py, line 5813 (link)

    P2 Redundant re-read of context_length from the compressor

    _orig_cc_cl is assigned getattr(_cc, 'context_length', 0) or 0 here, which reads the same attribute that was already captured as _cc_cl_u at line 5723. Within a single synchronous metering tick both reads return the same object, so this is only a minor redundancy — but naming them differently (_cc_cl_u vs _orig_cc_cl) makes the code harder to reason about and could cause confusion if the compressor becomes mutable from another thread between the two reads. Using _cc_cl_u directly (already in scope) here and in the ratio _orig_thresh * _real_ctx_cache[0] / _orig_cc_cl would make the intent clearer.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "fix(context): broaden streaming stale-co..." | Re-trigger Greptile

Comment thread api/streaming.py
Comment on lines +5793 to +5803
except TypeError:
# Older hermes-agent: legacy 2-arg form.
try:
from agent.model_metadata import get_model_context_length as _g_u
_real_u = _g_u(
_sm_u,
getattr(_agent, 'base_url', '') or '',
config_context_length=None,
provider=getattr(_agent, 'provider', '') or '',
) or 0
if _real_u:
from agent.model_metadata import get_model_context_length as _g2_u
_real_u = _g2_u(_sm_u, _base_u) or 0
if _real_u and _real_u != _cc_cl_u:
_resolved_real = _real_u
except Exception:
pass
except Exception:
pass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TypeError catch scope is wider than the legacy-compat intent

The except TypeError block is designed to catch get_model_context_length being called with keyword arguments it doesn't recognise on an older hermes-agent build. However, the guarded try block also calls _cli_u(...) (_context_length_lookup_inputs_for_model) and _gc_u() (get_config). A TypeError raised by either of those — e.g. if _context_length_lookup_inputs_for_model's signature changes and the call site isn't updated — would be silently rerouted to the legacy path, producing a less-accurate 2-arg lookup without any visible signal of the root cause. Consider wrapping just the _g_u(...) call in its own try/except TypeError so that errors from the routing helper are caught by the outer except Exception instead.

@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Pulled the branch (3beb18e9) and read the broadened guard against origin/master plus the hydration path it's trying to align with. The fix is correct and the parity argument in the description holds up under inspection.

The divergence the PR closes is real

On master the guard only corrected the exact-equal case:

if _cl_u > 0 and _cc_cl_u == _cl_u and _def_u and _sm_u and not _mmcd_u(...):

A compressor seeded with a different model's window (the 168k claude-opus-4.5 value on a 1M opus-4.8 session) is != model.context_length, so it fell straight through to every done event. Meanwhile GET /api/session hydration resolves the correct window via _resolve_context_length_for_session_model (routes.py:5132). That's exactly the "refresh shows 1M, send-a-message drops to 168k" split.

Parity with hydration is exact, which is the important part

The new streaming resolution (streaming.py:5768-5793) and the hydration resolver (routes.py:5153-5167) now call through identical inputs:

_lk_u = _cli_u(_sm_u, _prov_u, base_url=_base_u, api_key=_key_u, cfg=...)
_real_u = _g_u(_sm_u, _lk_u.base_url, api_key=_lk_u.api_key,
               config_context_length=_lk_u.config_context_length,
               provider=_lk_u.provider or _prov_u or '',
               custom_providers=_lk_u.custom_providers) or 0

Same helper (_context_length_lookup_inputs_for_model), same config_context_length/provider/custom_providers forwarding, same TypeError legacy 2-arg fallback. So the SSE path and GET /api/session land on the identical value — the description's claim that hand-reading flat model.context_length would reintroduce a 936k-vs-1M mismatch is right, because that field is None under nested per-model config. Reusing the helper is the correct call. I confirmed the agent-side signature matches: get_model_context_length(model, base_url, api_key, config_context_length, provider, custom_providers) in agent/model_metadata.py:1613.

Per-stream caching is intact

_real_ctx_cache = [None] is initialized once per _run_agent_streaming invocation (streaming.py:5646), and the resolution only runs when _real_ctx_cache[0] is None, so the config read + metadata lookup happens at most once per stream — the perf regression that froze non-default-model streams (per-tick resolution) stays fixed.

Two non-blocking notes for the maintainer

  1. Behavioral asymmetry vs hydration (intentional, worth knowing): the hydration resolver returns the resolved window unconditionally and replaces the persisted snapshot. The streaming guard applies it only when _real_u != _cc_cl_u (the _resolved_real=0 else-path leaves the compressor value untouched). That's correct — equal means nothing to fix — just flagging that the two paths "agree on the value" but differ on "when they overwrite."

  2. First-tick lookup shares the network-probe path. get_model_context_length's resolution order can hit the active-endpoint /models probe or the Anthropic /v1/models API for a model whose window isn't in the persistent cache (step 1) or config (step 0). Capped once-per-stream this is fine, and in practice the same session almost always resolved this model during hydration already (populating the persistent cache), so the first metering tick is a cache hit. Not a new cost class — hydration pays the same — but if a cold custom-endpoint model ever lands here, the first tick could block briefly on the probe.

Verification

Backend-only, single file, no test added — the description says it was verified end-to-end on a live opus-4.8/copilot turn (168k → 1.0M, no revert, threshold rescaled). Given the source-snapshot pattern for this area, a small unit asserting "compressor holds window A, agent.model resolves window B != A → _usage['context_length'] surfaces B and threshold_tokens rescales by B/A" would lock in the broadened branch and the threshold-rescale math (_orig_thresh * real / _orig_cc_cl at :5817). Looks merge-ready otherwise.

@nesquena-hermes nesquena-hermes added the size:M Medium PR (≤10 files, ≤250 LOC) label Jun 21, 2026
nesquena-hermes pushed a commit that referenced this pull request Jun 21, 2026
nesquena-hermes added a commit that referenced this pull request Jun 21, 2026
…l switch (#4618) (#4628)

* stage #4618 (allenliang2022): broaden streaming stale-compressor guard to any model-window mismatch + #4618 regression tests + CHANGELOG

Broadens #3256's default-only live-usage guard: the streaming SSE snapshot now
always resolves the real per-model window via the same helper GET /api/session
hydration uses (_context_length_lookup_inputs_for_model + get_model_context_length)
and corrects whenever it differs from the compressor's cached value, with a
TypeError fallback to the legacy 2-arg form. Fixes 'refresh shows 1M, send reverts
to stale 168k + early auto-compress' on model-switched sessions. Per-stream cache
preserved (one lookup/stream). Code byte-identical to PR head 3beb18e.

Adds 4 source-structure regression tests (RED-proven on master).

Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com>

* fix #4618 gate findings: profile-scoped config (Codex cross-profile) + #4248 256k acceptance gate (Opus downward-clobber)

Codex SHIP-WITH-FIXES: live-snapshot used ambient get_config() which in the
detached streaming worker resolves the process-global/default profile (#3294) ->
for a non-default profile pinning a different per-model context_length it would
surface the WRONG profile's window. Now resolves via get_config_for_profile_home
on the session's own profile home (mirrors the worker's _cfg resolution).

Opus SHIP-WITH-FIXES: broadened guard aligned resolution w/ hydration but not its
#4248 acceptance gate -> a transient low-confidence 256k metadata probe could
clobber a LARGER cached window mid-stream. Now reuses the exact hydration helper
_should_accept_session_context_length_refresh on both modern + legacy paths.

+ regression tests for both. Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com>

* fix #4618 Codex re-gate findings: broaden save + SSE-done stale-compressor guards too

Codex re-gate found the broadened live-snapshot guard fixed metering but the two
SIBLING paths still used the old default-only exact-cap test:
  - api/streaming.py final session-save: persisted stale other-model window (168k)
    to s.context_length -> wrong window on reload.
  - api/streaming.py terminal  SSE: emitted stale window -> indicator REVERTS
    on stream end (messages.js overwrites S.lastUsage) = the exact 'send reverts to
    168k' symptom.
Both now resolve the real per-model window via the same hydration helper and honor
the #4248 acceptance gate (no 256k downward-clobber), with legacy 2-arg fallback.
This is the root-cause completion across all 3 paths (live/save/SSE-done).

+ 2 regression tests. Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com>

* docs(#4618): note model_changed-omission rationale (Opus) + broaden CHANGELOG to all-3-paths

* Release v0.51.561 — Release TT (context-window indicator stays correct after model switch; #4618)

---------

Co-authored-by: nesquena-hermes <agent@nesquena-hermes>
Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com>
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Shipped in v0.51.561 🚀 — thank you @allenliang2022.

Your fix broadening the stale-compressor guard landed, and on review we extended the same correction to the two sibling paths that surfaced the window (the final session save and the terminal done SSE payload) so the indicator can't revert at stream-end or on reload either. The correction resolves the real per-model window through the same helper GET /api/session hydration uses, honors the #4248 256k-clobber protection, and reads the session's own profile config (not the ambient default) in the detached streaming worker.

Review trail: Codex regression gate SAFE TO SHIP, Opus advisor SAFE to ship, full suite 9965 passed, plus 10 new regression tests pinning the broadening + acceptance gate + profile-scoped config across all three paths.

Du7chManiac pushed a commit to TheCouchCoder-com/hermes-webui that referenced this pull request Jun 22, 2026
Release v0.51.561 — context-window indicator stays correct after model switch (nesquena#4618)

# Conflicts:
#	CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M Medium PR (≤10 files, ≤250 LOC)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants