fix(context): broaden streaming stale-compressor guard to any model-window mismatch#4618
Conversation
…indow mismatch The live-usage snapshot's nesquena#3256 default-only guard only corrected the compressor's cached context_length when it exactly equalled the config global cap (model.context_length). A compressor left holding a *different* model's window — e.g. a claude-opus-4.8 (1M) session whose compressor was seeded/last-updated with claude-opus-4.5's 168k — does not satisfy that == check, so the stale 168k passed straight through to every 'done' event. Symptom: refresh (GET /api/session hydration) shows the correct 1M, but sending a message reverts the indicator to 168k, and the auto-compress marker fires far too early. Broaden the guard: always resolve the real per-model window for the agent's CURRENT model and surface it whenever it differs from the compressor's cached value. Resolution reuses the SAME helper hydration uses (routes._context_length_lookup_inputs_for_model + get_model_context_length) so the streaming/SSE path and GET /api/session land on the IDENTICAL value, honoring nested per-model config overrides (model.<provider>.models.<model>.context_length) and custom-provider keys. Reusing the helper (instead of hand-reading the flat top-level model.context_length, which is None under nested config) avoids a new 'refresh 1M / send-a-message 936k' mismatch. The per-stream _real_ctx_cache (resolve at most once per stream, not per metering tick) is preserved. Backend-only, no frontend changes. Verified end-to-end on a real opus-4.8/copilot turn: indicator 168.0k -> 1.0M and no longer reverts on subsequent messages.
|
| Filename | Overview |
|---|---|
| api/streaming.py | Core streaming logic updated: stale-compressor guard broadened to correct any model-window mismatch, not just exact matches against the global cap. Resolution is cached once per stream for performance. Logic is sound with one minor scope concern in the TypeError handler. |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant FE as Frontend
participant SR as streaming.py (_run_agent_streaming)
participant CC as ContextCompressor
participant RTE as routes._context_length_lookup_inputs_for_model
participant MM as agent.model_metadata.get_model_context_length
participant Cache as _real_ctx_cache
FE->>SR: Send message → new stream opens
SR->>Cache: Initialize [None]
loop Metering tick
SR->>CC: read context_length (_cc_cl_u)
alt Cache is None (first tick only)
SR->>RTE: _cli_u(model, provider, base_url, api_key, cfg)
RTE-->>SR: _ContextLengthLookupInputs (_lk_u)
SR->>MM: get_model_context_length(model, base_url, api_key, config_context_length, ...)
MM-->>SR: _real_u (real per-model window)
alt "_real_u != _cc_cl_u (mismatch)"
SR->>Cache: store _real_u
else values match
SR->>Cache: store 0 (no correction needed)
end
end
alt "Cache[0] > 0 (correction cached)"
SR->>SR: "override context_length = cache[0]"
SR->>SR: rescale threshold_tokens proportionally
else no correction
SR->>SR: use compressor's raw context_length
end
SR-->>FE: SSE done event with corrected context_length
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant FE as Frontend
participant SR as streaming.py (_run_agent_streaming)
participant CC as ContextCompressor
participant RTE as routes._context_length_lookup_inputs_for_model
participant MM as agent.model_metadata.get_model_context_length
participant Cache as _real_ctx_cache
FE->>SR: Send message → new stream opens
SR->>Cache: Initialize [None]
loop Metering tick
SR->>CC: read context_length (_cc_cl_u)
alt Cache is None (first tick only)
SR->>RTE: _cli_u(model, provider, base_url, api_key, cfg)
RTE-->>SR: _ContextLengthLookupInputs (_lk_u)
SR->>MM: get_model_context_length(model, base_url, api_key, config_context_length, ...)
MM-->>SR: _real_u (real per-model window)
alt "_real_u != _cc_cl_u (mismatch)"
SR->>Cache: store _real_u
else values match
SR->>Cache: store 0 (no correction needed)
end
end
alt "Cache[0] > 0 (correction cached)"
SR->>SR: "override context_length = cache[0]"
SR->>SR: rescale threshold_tokens proportionally
else no correction
SR->>SR: use compressor's raw context_length
end
SR-->>FE: SSE done event with corrected context_length
end
Comments Outside Diff (1)
-
api/streaming.py, line 5813 (link)Redundant re-read of
context_lengthfrom the compressor_orig_cc_clis assignedgetattr(_cc, 'context_length', 0) or 0here, which reads the same attribute that was already captured as_cc_cl_uat line 5723. Within a single synchronous metering tick both reads return the same object, so this is only a minor redundancy — but naming them differently (_cc_cl_uvs_orig_cc_cl) makes the code harder to reason about and could cause confusion if the compressor becomes mutable from another thread between the two reads. Using_cc_cl_udirectly (already in scope) here and in the ratio_orig_thresh * _real_ctx_cache[0] / _orig_cc_clwould make the intent clearer.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "fix(context): broaden streaming stale-co..." | Re-trigger Greptile
| except TypeError: | ||
| # Older hermes-agent: legacy 2-arg form. | ||
| try: | ||
| from agent.model_metadata import get_model_context_length as _g_u | ||
| _real_u = _g_u( | ||
| _sm_u, | ||
| getattr(_agent, 'base_url', '') or '', | ||
| config_context_length=None, | ||
| provider=getattr(_agent, 'provider', '') or '', | ||
| ) or 0 | ||
| if _real_u: | ||
| from agent.model_metadata import get_model_context_length as _g2_u | ||
| _real_u = _g2_u(_sm_u, _base_u) or 0 | ||
| if _real_u and _real_u != _cc_cl_u: | ||
| _resolved_real = _real_u | ||
| except Exception: | ||
| pass | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
TypeError catch scope is wider than the legacy-compat intent
The except TypeError block is designed to catch get_model_context_length being called with keyword arguments it doesn't recognise on an older hermes-agent build. However, the guarded try block also calls _cli_u(...) (_context_length_lookup_inputs_for_model) and _gc_u() (get_config). A TypeError raised by either of those — e.g. if _context_length_lookup_inputs_for_model's signature changes and the call site isn't updated — would be silently rerouted to the legacy path, producing a less-accurate 2-arg lookup without any visible signal of the root cause. Consider wrapping just the _g_u(...) call in its own try/except TypeError so that errors from the routing helper are caught by the outer except Exception instead.
|
Pulled the branch ( The divergence the PR closes is realOn master the guard only corrected the exact-equal case: if _cl_u > 0 and _cc_cl_u == _cl_u and _def_u and _sm_u and not _mmcd_u(...):A compressor seeded with a different model's window (the 168k claude-opus-4.5 value on a 1M opus-4.8 session) is Parity with hydration is exact, which is the important partThe new streaming resolution ( _lk_u = _cli_u(_sm_u, _prov_u, base_url=_base_u, api_key=_key_u, cfg=...)
_real_u = _g_u(_sm_u, _lk_u.base_url, api_key=_lk_u.api_key,
config_context_length=_lk_u.config_context_length,
provider=_lk_u.provider or _prov_u or '',
custom_providers=_lk_u.custom_providers) or 0Same helper ( Per-stream caching is intact
Two non-blocking notes for the maintainer
VerificationBackend-only, single file, no test added — the description says it was verified end-to-end on a live opus-4.8/copilot turn (168k → 1.0M, no revert, threshold rescaled). Given the source-snapshot pattern for this area, a small unit asserting "compressor holds window A, agent.model resolves window B != A → |
…t after model switch; #4618)
…l switch (#4618) (#4628) * stage #4618 (allenliang2022): broaden streaming stale-compressor guard to any model-window mismatch + #4618 regression tests + CHANGELOG Broadens #3256's default-only live-usage guard: the streaming SSE snapshot now always resolves the real per-model window via the same helper GET /api/session hydration uses (_context_length_lookup_inputs_for_model + get_model_context_length) and corrects whenever it differs from the compressor's cached value, with a TypeError fallback to the legacy 2-arg form. Fixes 'refresh shows 1M, send reverts to stale 168k + early auto-compress' on model-switched sessions. Per-stream cache preserved (one lookup/stream). Code byte-identical to PR head 3beb18e. Adds 4 source-structure regression tests (RED-proven on master). Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com> * fix #4618 gate findings: profile-scoped config (Codex cross-profile) + #4248 256k acceptance gate (Opus downward-clobber) Codex SHIP-WITH-FIXES: live-snapshot used ambient get_config() which in the detached streaming worker resolves the process-global/default profile (#3294) -> for a non-default profile pinning a different per-model context_length it would surface the WRONG profile's window. Now resolves via get_config_for_profile_home on the session's own profile home (mirrors the worker's _cfg resolution). Opus SHIP-WITH-FIXES: broadened guard aligned resolution w/ hydration but not its #4248 acceptance gate -> a transient low-confidence 256k metadata probe could clobber a LARGER cached window mid-stream. Now reuses the exact hydration helper _should_accept_session_context_length_refresh on both modern + legacy paths. + regression tests for both. Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com> * fix #4618 Codex re-gate findings: broaden save + SSE-done stale-compressor guards too Codex re-gate found the broadened live-snapshot guard fixed metering but the two SIBLING paths still used the old default-only exact-cap test: - api/streaming.py final session-save: persisted stale other-model window (168k) to s.context_length -> wrong window on reload. - api/streaming.py terminal SSE: emitted stale window -> indicator REVERTS on stream end (messages.js overwrites S.lastUsage) = the exact 'send reverts to 168k' symptom. Both now resolve the real per-model window via the same hydration helper and honor the #4248 acceptance gate (no 256k downward-clobber), with legacy 2-arg fallback. This is the root-cause completion across all 3 paths (live/save/SSE-done). + 2 regression tests. Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com> * docs(#4618): note model_changed-omission rationale (Opus) + broaden CHANGELOG to all-3-paths * Release v0.51.561 — Release TT (context-window indicator stays correct after model switch; #4618) --------- Co-authored-by: nesquena-hermes <agent@nesquena-hermes> Co-authored-by: allenliang2022 <allenliang2022@users.noreply.github.com>
|
Shipped in v0.51.561 🚀 — thank you @allenliang2022. Your fix broadening the stale-compressor guard landed, and on review we extended the same correction to the two sibling paths that surfaced the window (the final session save and the terminal Review trail: Codex regression gate SAFE TO SHIP, Opus advisor SAFE to ship, full suite 9965 passed, plus 10 new regression tests pinning the broadening + acceptance gate + profile-scoped config across all three paths. |
Release v0.51.561 — context-window indicator stays correct after model switch (nesquena#4618) # Conflicts: # CHANGELOG.md
Problem
The live-usage SSE snapshot (
api/streaming.py,_run_agent_streaming) emitscontext_lengthfromagent.context_compressor.context_length. The #3256 "default-only guard" only undoes a stale value when it exactly equals the config global cap (model.context_length):But a compressor can hold a different model's window, not the config cap. Concretely: a session on
claude-opus-4.8(Copilot: 1M context / 936k prompt) whose compressor was seeded/last-updated withclaude-opus-4.5's 168000 window. Since168000 != model.context_length, the==check is false and the stale 168k passes straight through to everydoneevent.User-visible symptom
GET /api/sessionhydration via_resolve_context_length_for_session_model) → shows the correct 1.0M ✅doneSSE re-pushes the compressor's stale 168k → indicator reverts to 168.0k, and "Auto-compress at X" fires far too early (142.8k = 168k×0.85).i.e. "refresh shows 1M, the moment I send a message it drops to 168k." Hydration and streaming diverge.
Fix
Broaden the guard: always resolve the real per-model window for the agent's current model and surface it whenever it differs from the compressor's cached value (
_real_u and _real_u != _cc_cl_u).Resolution reuses the same helper hydration uses —
routes._context_length_lookup_inputs_for_model+get_model_context_length— so the streaming/SSE path andGET /api/sessionresolve to the identical value. This honors nested per-model config overrides (model.<provider>.models.<model>.context_length) and custom-provider keys.Reusing the helper (instead of hand-reading the flat top-level
model.context_length, which isNoneunder nested per-model config) is deliberate: a hand-read would fall through to the live catalog and return 936k while hydration returns 1M, creating a new "refresh 1M / send 936k" mismatch.Notes
_real_ctx_cachefrom the existing perf commit is preserved — the lookup still runs at most once per stream, not per metering tick (per-tick resolution previously froze non-default-model streams).threshold_tokens * real/orig) is unchanged and now also benefits the broadened case.Verification
Verified end-to-end on a real
claude-opus-4.8/ copilot turn (source build, source WebUI): the live indicator went 168.0k → 1.0M and no longer reverts on subsequent messages; the auto-compress marker rescaled to the real window.