Skip to content

fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing#361

Merged
mgoldsborough merged 3 commits into
mainfrom
fix/tool-discovery-stale-union-and-loop-brake
Jun 2, 2026
Merged

fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing#361
mgoldsborough merged 3 commits into
mainfrom
fix/tool-discovery-stale-union-and-loop-brake

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

@mgoldsborough mgoldsborough commented Jun 2, 2026

A healthy bundle's tools could become invisible to nb__search / nb__manage_tools for minutes, even though the bundle reported healthy.

Root cause

The cross-workspace tool-list cache memoized whatever the per-workspace lister returned — including a partial listing produced when a source was skipped mid-enumeration because it wasn't ready yet (cold start, subprocess restart, pending auth). That partial/empty snapshot then stuck — served to discovery — until some unrelated invalidation (the workspace.json FS watch, a membership change) happened to fire. The cached union predated the source's readiness, and nothing re-listed it.

Fix part 1 — never memoize a partial listing (d6b710e)

Make the cache correct by construction: a listing carries a complete flag (false iff a source was skipped), and neither cache layer memoizes an incomplete one.

  • WorkspaceToolLister returns { tools, complete }; the production lister sets complete=false whenever it skips a not-ready source.
  • getWorkspaceListing memoizes only a complete listing; a partial one is returned to the caller but left uncached, so the next ask re-lists.
  • getUnionForIdentity memoizes the per-identity union only when every contributing workspace listing was complete; concurrent first-askers share one in-flight fan-out.

This mirrors the existing identity-tools "don't memoize empty" guard (getIdentityDescriptors). It self-heals on the next discovery call once sources are up, with no dependence on an invalidation event firing.

Fix part 2 — don't lose an invalidation that lands mid-compute (f64fd14)

Part 1 introduced a subtle window of its own, caught in review. Pre-PR, the union promise was registered in identityUnions before awaiting and deleted on invalidation — so an invalidateWorkspace firing mid-compute self-healed (the pending entry was dropped; the next ask recomputed and re-subscribed). Part 1 changed the pattern to "set identityUnions only after resolution, and only if complete" — which reopened a clobber window: an invalidation fires mid-compute → clears the workspace's subscribedIdentities and deletes the not-yet-memoized union (a no-op) → the compute resolves and writes a stale union with a now-broken watcher subscription. That's the exact "stale until an unrelated event" failure this PR exists to kill, one window over.

Part 2 closes it with generation/epoch guards so memoization is refused when an invalidation landed during the compute (the in-flight maps stay — they handle correct complete signaling to concurrent callers; the epoch is orthogonal):

  • Per-workspace: WorkspaceWatchEntry.generation, bumped in invalidateWorkspace (which now also clears listingInFlight). getWorkspaceListing captures it before the lister and memoizes only if unchanged.
  • Per-identity: an identityEpochs counter, bumped in invalidateWorkspace (per subscribed identity) and invalidateIdentity. getUnionForIdentity captures it before the compute and memoizes only if unchanged — so a superseded union is left uncached and the next ask rebuilds and re-subscribes.

This restores the pre-PR self-heal without reintroducing the concurrent-wrong-complete race the in-flight maps fix.

Tests

  • Cold-start partial (d6b710e): a partial listing re-lists until complete, then memoizes — regression guard for the invisible-tools outage.
  • Mid-compute invalidation (f64fd14): an invalidateWorkspace fired while a union compute is in flight → the next ask re-lists (no stale memo) and the identity is still invalidated by a subsequent change (proving re-subscription). Verified red without the epoch guard.

Existing aggregator + watch + registry-invalidation suites updated for the new lister shape and pass unchanged in behavior. Full verify green locally: static checks, 3300 unit + 666 integration tests, 0 failures.


Scope note: the loop-brake (stopping an agent that re-searches for a tool that genuinely isn't there) was split out of this PR. It's defense-in-depth that this fix's trigger no longer needs urgently, and it's being built properly on MCP _meta rather than a structuredContent marker. Tracked separately.

The cross-workspace tool-list cache memoized whatever the per-workspace
lister returned — including a partial listing produced when a source was
skipped mid-enumeration because it wasn't ready yet (cold start, subprocess
restart, pending auth). That partial/empty snapshot then stuck — served to
nb__search and nb__manage_tools — until some unrelated invalidation (the
workspace.json FS watch, a membership change) happened to fire. Net effect: a
bundle's tools stayed invisible to discovery for minutes even though the
bundle reported healthy, because the cached union predated its readiness.

Make the cache correct by construction: a listing carries a `complete` flag
(false iff a source was skipped), and NEITHER cache layer memoizes an
incomplete one.

- WorkspaceToolLister returns { tools, complete }; the production lister sets
  complete=false whenever it skips a not-ready source.
- getWorkspaceListing memoizes only a complete listing; a partial one is
  returned to the caller but left uncached, so the next ask re-lists.
- getUnionForIdentity memoizes the per-identity union only when every
  contributing workspace listing was complete; concurrent first-askers still
  share one in-flight fan-out.

Mirrors the existing identity-tools "don't memoize empty" guard
(getIdentityDescriptors) — same principle, now applied to workspace tools.
Self-heals on the next discovery call once sources are up; no dependence on
an invalidation event firing. Adds a regression test.
@mgoldsborough mgoldsborough force-pushed the fix/tool-discovery-stale-union-and-loop-brake branch from 7b4af28 to d6b710e Compare June 2, 2026 06:08
@mgoldsborough mgoldsborough changed the title fix: tools invisible to discovery (stale tool-list cache) + unbounded discovery loop fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing Jun 2, 2026
QA caught a regression introduced by this PR's own refactor: moving cache
memoization from synchronous to post-await reopened the sticky-stale window.
If invalidateWorkspace(wsA) fires while a union compute is in flight, it
clears wsA's subscribedIdentities and deletes the not-yet-memoized union (a
no-op), then the compute resolves complete=true and memoizes a union whose wsA
slice predates the change — with the identity no longer subscribed to wsA's
watcher. Future wsA changes never invalidate it: "stale until an unrelated
invalidation," the exact failure this PR exists to kill, one window over.

Fix: generation/epoch guards so memoization is refused when an invalidation
landed during the compute.

- WorkspaceWatchEntry.generation: bumped in invalidateWorkspace (which now
  also clears listingInFlight); getWorkspaceListing captures it before the
  lister and only memoizes if unchanged.
- Per-identity epoch (identityEpochs): bumped in invalidateWorkspace (per
  subscribed identity) and invalidateIdentity; getUnionForIdentity captures it
  before compute and only memoizes if unchanged — leaving the union uncached
  so the next ask rebuilds AND re-subscribes.

Restores the pre-PR self-heal (synchronous set used to delete the pending
entry on invalidation) without the concurrent-wrong-complete race the
in-flight maps fix.

Also: drop the now-dead getWorkspaceTools wrapper (zero callers post-refactor)
and tighten the union cache-hit comment (now both complete AND fresh).

Adds a regression test (verified red without the guard): an invalidation
mid-compute -> next ask re-lists, and the identity is still invalidated by a
subsequent change (proving re-subscription).
Reviewer flagged it as the lone monotonic map. Document the deliberate
choice: growth is bounded (distinct identities per process, one small int
each) and a guarded reap would trade a correctness-sensitive conditional for
kilobytes — a plain delete resets to 0 and can false-match an in-flight
compute's captured epoch, memoizing a result an invalidation should reject.
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Jun 2, 2026
@mgoldsborough mgoldsborough merged commit 72231d8 into main Jun 2, 2026
5 checks passed
@mgoldsborough mgoldsborough deleted the fix/tool-discovery-stale-union-and-loop-brake branch June 2, 2026 07:11
mgoldsborough added a commit that referenced this pull request Jun 2, 2026
…ing results (#363)

* feat(engine): trip the loop supervisor on repeated non-advancing results

When a tool keeps reporting no progress while the model varies its input — the
canonical case being nb__search called with a fresh query each turn, each
returning a distinct "No tools matched <query>" — the supervisor's
(toolName, isError, content[, input]) fingerprint never repeats, so it never
trips and the run burns iterations re-searching for a capability that isn't
there. Input-aware success fingerprinting makes this structural: a varied
input reads as progress.

Add a third "stuck" axis. A tool may mark a result non-advancing; the
supervisor collapses consecutive non-advancing results from one tool to a
single canonical fingerprint, tripping after N regardless of how input or
content varied. The tool is then disabled for the run and the model is nudged
to surface the gap instead of searching on.

The signal rides in MCP `_meta` (the blessed channel for metadata-about-a-
result) under a reverse-DNS key (ai.nimblebrain/non-advancing) — not in
structuredContent (the tool's data) or a bespoke top-level field (dropped at
the boundary). To make it reachable:

- ToolResult gains a `_meta` slot mirroring CallToolResult._meta.
- The in-process tool boundary forwards result `_meta` onto the wire; McpSource
  reads result-level `_meta` back on both the inline and task paths. `_meta` is
  a loose object per the MCP schema, so arbitrary reverse-DNS keys survive the
  round-trip — verified by a test through a real in-process MCP transport.
- nb__search's no-match branch sets the key. Any tool, in-process or bundle,
  opts in the same way.

Read by key, not folded into the hash, so the input-aware success path is
untouched (varied-input real work still never trips).

Composes with the input-aware success fingerprint (#324): that fixed false
trips on progress; this fixes missed trips on no-progress. Follow-up to #361,
which removed the cache bug that triggered the original incident.

* refactor(engine): clarify non-advancing trip wording + _meta forwarding doc

Review polish on PR #363:
- Supervisor synth directive now says "made no progress N times in a row"
  instead of "returned the same result N times" — accurate across all three
  trip modes (the non-advancing path varies result text, so "same result"
  was imprecise for it).
- Document where _meta forwarding lives: the two serialization boundaries
  (defineInProcessApp, McpSource); a direct ToolSource carries _meta natively.
  Answers "does delegate/upjack need a forwarding one-liner?" in the contract
  — no, they don't own a boundary, so the channel is already general.

---------

Co-authored-by: Mathew Goldsborough <1759329+mgoldsborough@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant