fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing by mgoldsborough · Pull Request #361 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-06-02T04:14:04Z

A healthy bundle's tools could become invisible to nb__search / nb__manage_tools for minutes, even though the bundle reported healthy.

Root cause

The cross-workspace tool-list cache memoized whatever the per-workspace lister returned — including a partial listing produced when a source was skipped mid-enumeration because it wasn't ready yet (cold start, subprocess restart, pending auth). That partial/empty snapshot then stuck — served to discovery — until some unrelated invalidation (the workspace.json FS watch, a membership change) happened to fire. The cached union predated the source's readiness, and nothing re-listed it.

Fix part 1 — never memoize a partial listing (`d6b710e`)

Make the cache correct by construction: a listing carries a complete flag (false iff a source was skipped), and neither cache layer memoizes an incomplete one.

WorkspaceToolLister returns { tools, complete }; the production lister sets complete=false whenever it skips a not-ready source.
getWorkspaceListing memoizes only a complete listing; a partial one is returned to the caller but left uncached, so the next ask re-lists.
getUnionForIdentity memoizes the per-identity union only when every contributing workspace listing was complete; concurrent first-askers share one in-flight fan-out.

This mirrors the existing identity-tools "don't memoize empty" guard (getIdentityDescriptors). It self-heals on the next discovery call once sources are up, with no dependence on an invalidation event firing.

Fix part 2 — don't lose an invalidation that lands mid-compute (`f64fd14`)

Part 1 introduced a subtle window of its own, caught in review. Pre-PR, the union promise was registered in identityUnions before awaiting and deleted on invalidation — so an invalidateWorkspace firing mid-compute self-healed (the pending entry was dropped; the next ask recomputed and re-subscribed). Part 1 changed the pattern to "set identityUnions only after resolution, and only if complete" — which reopened a clobber window: an invalidation fires mid-compute → clears the workspace's subscribedIdentities and deletes the not-yet-memoized union (a no-op) → the compute resolves and writes a stale union with a now-broken watcher subscription. That's the exact "stale until an unrelated event" failure this PR exists to kill, one window over.

Part 2 closes it with generation/epoch guards so memoization is refused when an invalidation landed during the compute (the in-flight maps stay — they handle correct complete signaling to concurrent callers; the epoch is orthogonal):

Per-workspace: WorkspaceWatchEntry.generation, bumped in invalidateWorkspace (which now also clears listingInFlight). getWorkspaceListing captures it before the lister and memoizes only if unchanged.
Per-identity: an identityEpochs counter, bumped in invalidateWorkspace (per subscribed identity) and invalidateIdentity. getUnionForIdentity captures it before the compute and memoizes only if unchanged — so a superseded union is left uncached and the next ask rebuilds and re-subscribes.

This restores the pre-PR self-heal without reintroducing the concurrent-wrong-complete race the in-flight maps fix.

Tests

Cold-start partial (d6b710e): a partial listing re-lists until complete, then memoizes — regression guard for the invisible-tools outage.
Mid-compute invalidation (f64fd14): an invalidateWorkspace fired while a union compute is in flight → the next ask re-lists (no stale memo) and the identity is still invalidated by a subsequent change (proving re-subscription). Verified red without the epoch guard.

Existing aggregator + watch + registry-invalidation suites updated for the new lister shape and pass unchanged in behavior. Full verify green locally: static checks, 3300 unit + 666 integration tests, 0 failures.

Scope note: the loop-brake (stopping an agent that re-searches for a tool that genuinely isn't there) was split out of this PR. It's defense-in-depth that this fix's trigger no longer needs urgently, and it's being built properly on MCP _meta rather than a structuredContent marker. Tracked separately.

The cross-workspace tool-list cache memoized whatever the per-workspace lister returned — including a partial listing produced when a source was skipped mid-enumeration because it wasn't ready yet (cold start, subprocess restart, pending auth). That partial/empty snapshot then stuck — served to nb__search and nb__manage_tools — until some unrelated invalidation (the workspace.json FS watch, a membership change) happened to fire. Net effect: a bundle's tools stayed invisible to discovery for minutes even though the bundle reported healthy, because the cached union predated its readiness. Make the cache correct by construction: a listing carries a `complete` flag (false iff a source was skipped), and NEITHER cache layer memoizes an incomplete one. - WorkspaceToolLister returns { tools, complete }; the production lister sets complete=false whenever it skips a not-ready source. - getWorkspaceListing memoizes only a complete listing; a partial one is returned to the caller but left uncached, so the next ask re-lists. - getUnionForIdentity memoizes the per-identity union only when every contributing workspace listing was complete; concurrent first-askers still share one in-flight fan-out. Mirrors the existing identity-tools "don't memoize empty" guard (getIdentityDescriptors) — same principle, now applied to workspace tools. Self-heals on the next discovery call once sources are up; no dependence on an invalidation event firing. Adds a regression test.

QA caught a regression introduced by this PR's own refactor: moving cache memoization from synchronous to post-await reopened the sticky-stale window. If invalidateWorkspace(wsA) fires while a union compute is in flight, it clears wsA's subscribedIdentities and deletes the not-yet-memoized union (a no-op), then the compute resolves complete=true and memoizes a union whose wsA slice predates the change — with the identity no longer subscribed to wsA's watcher. Future wsA changes never invalidate it: "stale until an unrelated invalidation," the exact failure this PR exists to kill, one window over. Fix: generation/epoch guards so memoization is refused when an invalidation landed during the compute. - WorkspaceWatchEntry.generation: bumped in invalidateWorkspace (which now also clears listingInFlight); getWorkspaceListing captures it before the lister and only memoizes if unchanged. - Per-identity epoch (identityEpochs): bumped in invalidateWorkspace (per subscribed identity) and invalidateIdentity; getUnionForIdentity captures it before compute and only memoizes if unchanged — leaving the union uncached so the next ask rebuilds AND re-subscribes. Restores the pre-PR self-heal (synchronous set used to delete the pending entry on invalidation) without the concurrent-wrong-complete race the in-flight maps fix. Also: drop the now-dead getWorkspaceTools wrapper (zero callers post-refactor) and tighten the union cache-hit comment (now both complete AND fresh). Adds a regression test (verified red without the guard): an invalidation mid-compute -> next ask re-lists, and the identity is still invalidated by a subsequent change (proving re-subscription).

Reviewer flagged it as the lone monotonic map. Document the deliberate choice: growth is bounded (distinct identities per process, one small int each) and a guarded reap would trade a correctness-sensitive conditional for kilobytes — a plain delete resets to 0 and can false-match an in-flight compute's captured epoch, memoizing a result an invalidation should reject.

…ing results (#363) * feat(engine): trip the loop supervisor on repeated non-advancing results When a tool keeps reporting no progress while the model varies its input — the canonical case being nb__search called with a fresh query each turn, each returning a distinct "No tools matched <query>" — the supervisor's (toolName, isError, content[, input]) fingerprint never repeats, so it never trips and the run burns iterations re-searching for a capability that isn't there. Input-aware success fingerprinting makes this structural: a varied input reads as progress. Add a third "stuck" axis. A tool may mark a result non-advancing; the supervisor collapses consecutive non-advancing results from one tool to a single canonical fingerprint, tripping after N regardless of how input or content varied. The tool is then disabled for the run and the model is nudged to surface the gap instead of searching on. The signal rides in MCP `_meta` (the blessed channel for metadata-about-a- result) under a reverse-DNS key (ai.nimblebrain/non-advancing) — not in structuredContent (the tool's data) or a bespoke top-level field (dropped at the boundary). To make it reachable: - ToolResult gains a `_meta` slot mirroring CallToolResult._meta. - The in-process tool boundary forwards result `_meta` onto the wire; McpSource reads result-level `_meta` back on both the inline and task paths. `_meta` is a loose object per the MCP schema, so arbitrary reverse-DNS keys survive the round-trip — verified by a test through a real in-process MCP transport. - nb__search's no-match branch sets the key. Any tool, in-process or bundle, opts in the same way. Read by key, not folded into the hash, so the input-aware success path is untouched (varied-input real work still never trips). Composes with the input-aware success fingerprint (#324): that fixed false trips on progress; this fixes missed trips on no-progress. Follow-up to #361, which removed the cache bug that triggered the original incident. * refactor(engine): clarify non-advancing trip wording + _meta forwarding doc Review polish on PR #363: - Supervisor synth directive now says "made no progress N times in a row" instead of "returned the same result N times" — accurate across all three trip modes (the non-advancing path varies result text, so "same result" was imprecise for it). - Document where _meta forwarding lives: the two serialization boundaries (defineInProcessApp, McpSource); a direct ToolSource carries _meta natively. Answers "does delegate/upjack need a forwarding one-liner?" in the contract — no, they don't own a boundary, so the channel is already general. --------- Co-authored-by: Mathew Goldsborough <1759329+mgoldsborough@users.noreply.github.com>

mgoldsborough force-pushed the fix/tool-discovery-stale-union-and-loop-brake branch from 7b4af28 to d6b710e Compare June 2, 2026 06:08

mgoldsborough changed the title ~~fix: tools invisible to discovery (stale tool-list cache) + unbounded discovery loop~~ fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing Jun 2, 2026

mgoldsborough added 2 commits June 1, 2026 20:25

mgoldsborough added the qa-reviewed QA review completed with no critical issues label Jun 2, 2026

mgoldsborough merged commit 72231d8 into main Jun 2, 2026
5 checks passed

mgoldsborough deleted the fix/tool-discovery-stale-union-and-loop-brake branch June 2, 2026 07:11

mgoldsborough mentioned this pull request Jun 2, 2026

feat(engine): loop-brake — trip the supervisor on repeated non-advancing results #363

Merged

mgoldsborough mentioned this pull request Jun 3, 2026

Add Composio connector batch: Slack, Google Calendar, GitHub, QuickBooks, Box, WhatsApp, LinkedIn #373

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing#361

fix(orchestrator): tools invisible to discovery — never memoize a partial tool listing#361
mgoldsborough merged 3 commits into
mainfrom
fix/tool-discovery-stale-union-and-loop-brake

mgoldsborough commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix part 1 — never memoize a partial listing (d6b710e)

Fix part 2 — don't lose an invalidation that lands mid-compute (f64fd14)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoldsborough commented Jun 2, 2026 •

edited

Loading

Fix part 1 — never memoize a partial listing (`d6b710e`)

Fix part 2 — don't lose an invalidation that lands mid-compute (`f64fd14`)