fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing#361
Merged
mgoldsborough merged 3 commits intoJun 2, 2026
Merged
Conversation
The cross-workspace tool-list cache memoized whatever the per-workspace
lister returned — including a partial listing produced when a source was
skipped mid-enumeration because it wasn't ready yet (cold start, subprocess
restart, pending auth). That partial/empty snapshot then stuck — served to
nb__search and nb__manage_tools — until some unrelated invalidation (the
workspace.json FS watch, a membership change) happened to fire. Net effect: a
bundle's tools stayed invisible to discovery for minutes even though the
bundle reported healthy, because the cached union predated its readiness.
Make the cache correct by construction: a listing carries a `complete` flag
(false iff a source was skipped), and NEITHER cache layer memoizes an
incomplete one.
- WorkspaceToolLister returns { tools, complete }; the production lister sets
complete=false whenever it skips a not-ready source.
- getWorkspaceListing memoizes only a complete listing; a partial one is
returned to the caller but left uncached, so the next ask re-lists.
- getUnionForIdentity memoizes the per-identity union only when every
contributing workspace listing was complete; concurrent first-askers still
share one in-flight fan-out.
Mirrors the existing identity-tools "don't memoize empty" guard
(getIdentityDescriptors) — same principle, now applied to workspace tools.
Self-heals on the next discovery call once sources are up; no dependence on
an invalidation event firing. Adds a regression test.
7b4af28 to
d6b710e
Compare
QA caught a regression introduced by this PR's own refactor: moving cache memoization from synchronous to post-await reopened the sticky-stale window. If invalidateWorkspace(wsA) fires while a union compute is in flight, it clears wsA's subscribedIdentities and deletes the not-yet-memoized union (a no-op), then the compute resolves complete=true and memoizes a union whose wsA slice predates the change — with the identity no longer subscribed to wsA's watcher. Future wsA changes never invalidate it: "stale until an unrelated invalidation," the exact failure this PR exists to kill, one window over. Fix: generation/epoch guards so memoization is refused when an invalidation landed during the compute. - WorkspaceWatchEntry.generation: bumped in invalidateWorkspace (which now also clears listingInFlight); getWorkspaceListing captures it before the lister and only memoizes if unchanged. - Per-identity epoch (identityEpochs): bumped in invalidateWorkspace (per subscribed identity) and invalidateIdentity; getUnionForIdentity captures it before compute and only memoizes if unchanged — leaving the union uncached so the next ask rebuilds AND re-subscribes. Restores the pre-PR self-heal (synchronous set used to delete the pending entry on invalidation) without the concurrent-wrong-complete race the in-flight maps fix. Also: drop the now-dead getWorkspaceTools wrapper (zero callers post-refactor) and tighten the union cache-hit comment (now both complete AND fresh). Adds a regression test (verified red without the guard): an invalidation mid-compute -> next ask re-lists, and the identity is still invalidated by a subsequent change (proving re-subscription).
Reviewer flagged it as the lone monotonic map. Document the deliberate choice: growth is bounded (distinct identities per process, one small int each) and a guarded reap would trade a correctness-sensitive conditional for kilobytes — a plain delete resets to 0 and can false-match an in-flight compute's captured epoch, memoizing a result an invalidation should reject.
mgoldsborough
added a commit
that referenced
this pull request
Jun 2, 2026
…ing results (#363) * feat(engine): trip the loop supervisor on repeated non-advancing results When a tool keeps reporting no progress while the model varies its input — the canonical case being nb__search called with a fresh query each turn, each returning a distinct "No tools matched <query>" — the supervisor's (toolName, isError, content[, input]) fingerprint never repeats, so it never trips and the run burns iterations re-searching for a capability that isn't there. Input-aware success fingerprinting makes this structural: a varied input reads as progress. Add a third "stuck" axis. A tool may mark a result non-advancing; the supervisor collapses consecutive non-advancing results from one tool to a single canonical fingerprint, tripping after N regardless of how input or content varied. The tool is then disabled for the run and the model is nudged to surface the gap instead of searching on. The signal rides in MCP `_meta` (the blessed channel for metadata-about-a- result) under a reverse-DNS key (ai.nimblebrain/non-advancing) — not in structuredContent (the tool's data) or a bespoke top-level field (dropped at the boundary). To make it reachable: - ToolResult gains a `_meta` slot mirroring CallToolResult._meta. - The in-process tool boundary forwards result `_meta` onto the wire; McpSource reads result-level `_meta` back on both the inline and task paths. `_meta` is a loose object per the MCP schema, so arbitrary reverse-DNS keys survive the round-trip — verified by a test through a real in-process MCP transport. - nb__search's no-match branch sets the key. Any tool, in-process or bundle, opts in the same way. Read by key, not folded into the hash, so the input-aware success path is untouched (varied-input real work still never trips). Composes with the input-aware success fingerprint (#324): that fixed false trips on progress; this fixes missed trips on no-progress. Follow-up to #361, which removed the cache bug that triggered the original incident. * refactor(engine): clarify non-advancing trip wording + _meta forwarding doc Review polish on PR #363: - Supervisor synth directive now says "made no progress N times in a row" instead of "returned the same result N times" — accurate across all three trip modes (the non-advancing path varies result text, so "same result" was imprecise for it). - Document where _meta forwarding lives: the two serialization boundaries (defineInProcessApp, McpSource); a direct ToolSource carries _meta natively. Answers "does delegate/upjack need a forwarding one-liner?" in the contract — no, they don't own a boundary, so the channel is already general. --------- Co-authored-by: Mathew Goldsborough <1759329+mgoldsborough@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A healthy bundle's tools could become invisible to
nb__search/nb__manage_toolsfor minutes, even though the bundle reported healthy.Root cause
The cross-workspace tool-list cache memoized whatever the per-workspace lister returned — including a partial listing produced when a source was skipped mid-enumeration because it wasn't ready yet (cold start, subprocess restart, pending auth). That partial/empty snapshot then stuck — served to discovery — until some unrelated invalidation (the
workspace.jsonFS watch, a membership change) happened to fire. The cached union predated the source's readiness, and nothing re-listed it.Fix part 1 — never memoize a partial listing (
d6b710e)Make the cache correct by construction: a listing carries a
completeflag (false iff a source was skipped), and neither cache layer memoizes an incomplete one.WorkspaceToolListerreturns{ tools, complete }; the production lister setscomplete=falsewhenever it skips a not-ready source.getWorkspaceListingmemoizes only a complete listing; a partial one is returned to the caller but left uncached, so the next ask re-lists.getUnionForIdentitymemoizes the per-identity union only when every contributing workspace listing was complete; concurrent first-askers share one in-flight fan-out.This mirrors the existing identity-tools "don't memoize empty" guard (
getIdentityDescriptors). It self-heals on the next discovery call once sources are up, with no dependence on an invalidation event firing.Fix part 2 — don't lose an invalidation that lands mid-compute (
f64fd14)Part 1 introduced a subtle window of its own, caught in review. Pre-PR, the union promise was registered in
identityUnionsbefore awaiting and deleted on invalidation — so aninvalidateWorkspacefiring mid-compute self-healed (the pending entry was dropped; the next ask recomputed and re-subscribed). Part 1 changed the pattern to "setidentityUnionsonly after resolution, and only if complete" — which reopened a clobber window: an invalidation fires mid-compute → clears the workspace'ssubscribedIdentitiesand deletes the not-yet-memoized union (a no-op) → the compute resolves and writes a stale union with a now-broken watcher subscription. That's the exact "stale until an unrelated event" failure this PR exists to kill, one window over.Part 2 closes it with generation/epoch guards so memoization is refused when an invalidation landed during the compute (the in-flight maps stay — they handle correct
completesignaling to concurrent callers; the epoch is orthogonal):WorkspaceWatchEntry.generation, bumped ininvalidateWorkspace(which now also clearslistingInFlight).getWorkspaceListingcaptures it before the lister and memoizes only if unchanged.identityEpochscounter, bumped ininvalidateWorkspace(per subscribed identity) andinvalidateIdentity.getUnionForIdentitycaptures it before the compute and memoizes only if unchanged — so a superseded union is left uncached and the next ask rebuilds and re-subscribes.This restores the pre-PR self-heal without reintroducing the concurrent-wrong-
completerace the in-flight maps fix.Tests
d6b710e): a partial listing re-lists until complete, then memoizes — regression guard for the invisible-tools outage.f64fd14): aninvalidateWorkspacefired while a union compute is in flight → the next ask re-lists (no stale memo) and the identity is still invalidated by a subsequent change (proving re-subscription). Verified red without the epoch guard.Existing aggregator + watch + registry-invalidation suites updated for the new lister shape and pass unchanged in behavior. Full
verifygreen locally: static checks, 3300 unit + 666 integration tests, 0 failures.Scope note: the loop-brake (stopping an agent that re-searches for a tool that genuinely isn't there) was split out of this PR. It's defense-in-depth that this fix's trigger no longer needs urgently, and it's being built properly on MCP
_metarather than astructuredContentmarker. Tracked separately.