Skip to content

Latest commit

Β 

History

History
308 lines (260 loc) Β· 39.9 KB

File metadata and controls

308 lines (260 loc) Β· 39.9 KB

Nanobot Backlog

Single source of truth for all actionable work. ROADMAP.md = vision. This file = what to do next.


Ways of Working

The codebase has coherence debt. Features are designed but not wired end-to-end. Trio mode exists in docs and config flags but never completed a real session. Compaction exists but crashes when it matters most. Context budgets exist but aren't enforced per role. The pattern: design β†’ partial implement β†’ patch holes β†’ next feature.

New rule: every blocking item must have a verification step. "Done" means: ran a real session with local models, checked metrics, checked logs, confirmed the feature actually fired. Not "code compiles and tests pass."

When working with Claude Code on these items:

  1. One item at a time. Don't let it "also fix" adjacent things.
  2. Start with a failing test or reproduction. Show the broken state first.
  3. Verify with metrics.jsonl + nanobot.log after every change.
  4. Read the existing code before writing. Half these bugs are "feature exists but isn't called."
  5. B3-B11 are green. I7/LCM E2E verified. L1 concept router 80% accurate (2x orchestrator). QW1-QW4 shipped. Blockers remaining: B5 (experiments), B12 (config debt β€” proposal needed). Next priority: B12 proposal β†’ I9 (tiered routing) β†’ I8 (SearXNG).

Anti-patterns (from 2026-02-23 review):

  1. Stop adding router parsing fallbacks β€” 4 parsers mask model quality. Use constrained decoding instead.
  2. Stop adding config fields without removing others β€” 90+ fields = untestable. Feature-gate experiments, don't add runtime toggles.
  3. Stop building Phase 2-3 code during Phase 0 β€” write designs in docs, not code in src/.
  4. Stop using #![allow(dead_code)] at module level β€” apply per-item only.
  5. Stop mixing concerns in TurnContext β€” decompose 28-field struct into sub-structs with explicit dependencies.

Phase 0: Foundation (current)

πŸ”΄ Blocking β€” do first

  • B11: Heartbeat as foundational liveness service ⚑ β€” Priority. The current heartbeat is a glorified cron β€” shell commands + optional HEARTBEAT.md check. It needs to become the central liveness and health service that all modes (text/voice), channels (CLI/Telegram/WhatsApp), and configurations (local trio/cloud) depend on. Current state (verified 2026-02-21): Runs one hardcoded command (qmd update -c sessions), LLM callback always None (never wired), #![allow(dead_code)] on the module. Zero awareness of providers, endpoints, channels, or self-health. Health probes needed:

    Probe What Frequency Action on failure
    Provider health Ping each configured provider (/v1/models or equivalent) Every tick Mark provider unavailable, trigger fallback chain
    LCM compactor Ping compactionEndpoint.url + verify model loaded Every tick Set lcm.available = false, skip compaction gracefully, warn user
    Trio models Verify router/specialist models loaded in LM Studio Every tick (local only) Degrade to inline mode, log which role is missing
    Search backend Ping SearXNG / check Brave API key validity Every 5 ticks Disable web_search tool, surface in /status
    Channel liveness Telegram/WhatsApp/Email connection state Per-channel interval Reconnect with backoff, surface in /status
    Self-health Session size, memory pressure, disk space Every tick Trigger compaction, warn user, auto-rotate session
    Critical gap β€” LCM compaction has no pre-flight health check: When qwen3-0.6b is unreachable, Level 1+2 silently fail β†’ Level 3 deterministic fallback fires (works, but lossy). No warning, no status surfaced. If the LLM hangs (not down, just slow), the compaction spawn blocks indefinitely and in_flight never resets β€” blocking ALL future compaction for the session.
    Design principles:
    1. Channel-agnostic β€” same health loop whether CLI, Telegram, or voice mode
    2. Mode-agnostic β€” works for local trio, cloud, and hybrid configurations
    3. Graceful degradation β€” failures disable features, never crash. Compactor down β†’ LCM pauses. Router gone β†’ deterministic routing. Provider down β†’ queue + retry.
    4. Observable β€” /status REPL command shows all probe states. Status injectable into context (connects to N6).
    5. No LLM required β€” health probes are HTTP pings and process checks, not agent tasks. Existing Layer 2 (HEARTBEAT.md β†’ agent) stays as optional add-on. Implementation sketch:
    • HealthRegistry β€” register probes at startup based on config (if lcm.enabled, register LCM probe; if trio, register trio probes; etc.)
    • Each probe: name, check() β†’ Result<(), String>, interval, on_failure callback
    • HeartbeatService runs the registry on each tick, stores probe states in SystemState
    • /status command reads probe states
    • Provider/router/compaction code checks probe state before attempting calls
    • Timeout guard on compaction spawn (default 30s) β€” kill task, reset in_flight, log error Compounds with: N6 (status injection), I8 (SearXNG health), I4 (multi-provider fallback), I7 (LCM compactor availability), I9 (tiered routing needs health-aware escalation). Ref: src/heartbeat/service.rs, src/agent/agent_loop.rs (compaction spawn), src/agent/system_state.rs
  • B3: Update default local trio β€” βœ… Trio configured: Main gemma-3n-e4b-it (server.rs:18), Router nvidia_orchestrator-8b, Specialist ministral-3-8b-instruct-2512 (both in config.json trio section + B10 auto-detect as fallback). TrioConfig::default() has empty strings but runtime always populated via explicit config or auto-detect.

  • B4: Multi-model config schema β€” βœ… Obsolete as scoped. TrioConfig already provides per-role model/port/ctx_tokens/temperature/endpoint for router and specialist. LM Studio JIT-loads models on demand β€” no need for nanobot to spawn separate llama-server instances. The local.rlm slot became the specialist role; local.memory uses memory.model config.

  • B5: RLM model evaluation β€” Systematic experiments to find best RLM model per VRAM tier. Critical for "3 impossible things". See experiment plan below. Routing benchmarks started in experiments/lcm-routing/ (orchestrator_bench.py, test_bench.py).

  • B8: Trio mode activation & role-scoped context ⚑ β€” βœ… All 5 steps complete. Metrics + circuit breaker (commit 0f80ad9). Auto-activation + auto-detect as B10 (commit 3774742). E2E verified: local session β†’ delegation_mode=Trio in log β†’ Main emits natural language β†’ Router preflight intercepts β†’ Specialist executes tool. Ref: src/agent/router.rs

βœ… Codebase Health (Phases 1-4 complete, 2026-03-04)

  • Phase 1: parking_lot::Mutex (18 files), LazyLock regexes (20+ sites)
  • Phase 2: Credential redaction (custom Debug impls), removed dead needs_user_continuation
  • Phase 3: ToolError retryability + execute_with_retry() with exponential backoff
  • Phase 4: File splits β€” tool_runner.rs tests extracted, cli.rs β†’ src/cli/ module (6 files), agent_loop.rs β†’ 3 files via #[path], commands.rs β†’ 7 files via #[path]. All 4117 tests pass.
  • Phase 5: Shell injection adversarial tests (15 evasion patterns), LM Studio --jinja probe, MLX in-process provider wiring β€” MlxProvider connected to all 4 entry points (REPL, gateway, voice, rebuild). Config: inferenceEngine: "mlx". Perplexity gate auto-enabled when MLX active. 4 E2E tests (chat, perplexity, train, closed-loop). 4157 tests pass.

🟑 Important β€” do soon

  • [~] B12: Configuration debt β€” eliminate hardcoded magic values (Approach E implemented: Phases 1-3 complete, Phase 4 deferred)

    • βœ… Phase 1: ModelCapabilities registry (commit 22107ad) β€” Eliminated 7+ model name sniffing sites across tool_runner.rs, compaction.rs, agent_core.rs, thread_repair.rs, subagent.rs. Capability flags (supports_tool_calling, supports_thinking, max_reliable_output, etc.) replace model.contains("nanbeige") string matches. Config-overridable via modelCapabilities map. File: src/agent/model_capabilities.rs (348 lines, 24 tests).
    • βœ… Phase 2-3: Module-local configs (commit 22107ad) β€” CircuitBreaker, Subagent, Session, Compaction, and Hygiene tuning moved from hardcoded const values to schema-backed structs with #[serde(default)]. File: src/config/schema.rs (+215 lines). Result: ~50 hardcoded values are now configurable without touching module source. Existing config.json files unchanged (all new fields use #[serde(default)]).
    • ❌ Phase 4: Deferred β€” Lower priority; items below are I/O-coupled domain knowledge where config-driven tuning adds less value than Phases 1-3:
      • HeartbeatConfig (health probe intervals and failure thresholds)
      • PipelineTuning (step iteration limits for I9 tiered routing)
      • ProvenanceConfig (audit log extension fields)
      • Rationale: Phases 1-3 addressed 83% of the abstraction debt (model sniffing + numeric knobs). Phase 4 items can be added incrementally as those features mature.
    • Status as of: 2026-02-21, commit 22107ad
    • Design: docs/plans/b12-config-debt-elimination.md (implemented per Approach E)
    • Compounds with: B11 (heartbeat needs config-driven probes β€” Phase 4), I9 (tiered routing needs configurable thresholds β€” Phase 4), N1 (hardware auto-detection feeds profile selection).
    • Ref: src/agent/model_capabilities.rs, src/config/schema.rs
  • I0: Trio pipeline actions β€” Router can only emit ONE action per turn. Multi-step tasks (research + synthesize) fail because the router picks one tool and stops. Need pipeline-as-first-class router output + shared scratchpad between trio roles. Superseded by I9 for the routing layer β€” I0 remains relevant for the execution/scratchpad side. Ref: thoughts/shared/plans/2026-02-20-trio-pipeline-architecture.md

  • I9: Tiered routing with orchestrator escalation ⚑ β€” Priority. The L1 concept router (all-MiniLM-L6-v2, centroid classification) proved 80% accurate at 5ms/0 VRAM vs orchestrator-8b's 43% at 637ms/6GB. But rigid template matching (L2) caps at ~7 predefined multi-step patterns. Real workflows need 10-100+ steps with dynamic re-planning, conditional branching, and error recovery β€” templates can't express this. Architecture: Three-tier routing with orchestrator escalation:

    Tier Engine Latency When Traffic %
    T1: Concept Router Embedding centroid (CPU) ~5ms Unambiguous single-action queries ~70%
    T2: Template Expander Embedding β†’ template match ~10ms Known multi-step patterns (L2 templates) ~15%
    T3: Orchestrator Reasoning LLM (nemotron-orchestrator-8b or nanbeige) ~600ms Complex/novel/failing workflows, low-confidence T1 ~15%
    Escalation triggers (T1β†’T3): (a) Cosine similarity margin <0.4 (low confidence). (b) No template match for detected multi-step intent. (c) Step failure mid-execution (re-plan). (d) User query references prior context (pragmatic ambiguity).
    T3 orchestrator responsibilities: Dynamic step decomposition (not limited to templates). Mid-workflow re-planning when steps fail or return unexpected results. Conditional branching (if build fails β†’ fix errors β†’ retry). State tracking across 10-100+ steps via scratchpad. Budget/token cost monitoring.
    Key insight from L1 experiments: The 6 concept router failures were all pragmatic (hedging, vagueness, context-dependent) β€” exactly the cases where an LLM reasoning model adds value. The concept router handles the ~70% easy cases at zero cost; the orchestrator handles the ~15% hard cases where reasoning matters. This is cheaper than running the orchestrator on 100% of traffic.
    Implementation path: 1) Wire concept router into router_preflight() as fast path. 2) Add confidence threshold β€” below 0.4, escalate to LLM orchestrator. 3) Add step executor that consumes Vec<RouterDecision> from either T2 templates or T3 orchestrator. 4) Add shared scratchpad for multi-step state. 5) Add re-planning hook: when a step fails, send context + failure to T3 for new plan.
    Compounds with: I0 (pipeline execution), L1/L2 experiments (experiments/lcm-routing/), B5 (model evaluation).
    Ref: experiments/lcm-routing/results/L1_analysis.md, experiments/lcm-routing/multi_step_templates.json
  • I1: Local role/protocol crashes β€” Fix system role crash, alternation crash, orphan tool messages. Thread repair pipeline exists but needs hardening. Ref: docs/plans/local-trio-strategy-2026-02-18.md, docs/plans/local-model-reliability-tdd.md QW1 (2026-02-23): Tool guard per-tool limits shipped β€” read tools get limit 2 (tightened from 5, 2026-02-28), write tools keep default 3. Multi-file read blockage resolved.

  • I2: Non-blocking compaction β€” βœ… Absorbed into I7 (matryoshka compaction). Per-cluster parallel summarization replaces the three-tier approach.

  • I3: Context Gate β€” Replace dumb char-limit truncation with ContentGate: pass raw / structural briefing / drill-down. Zero agent-facing API changes. Partial progress: admit_with_specialist() in context_gate.rs (commit 3580c38) provides the structural briefing path via specialist LLM. Ref: docs/plans/context-gate.md, docs/plans/context-protocol.md

  • I4: Multi-provider refactor β€” Break up SwappableCore god struct, extensible provider registry, fallback chains. Ref: docs/plans/multi-provider-refactor.md, docs/plans/nanobot_architecture_review.md

  • I5: Dynamic model router β€” Prompt complexity scoring (light/standard/heavy), auto-downgrade simple messages to cheaper models. Ref: docs/plans/dynamic-model-router.md

  • I6: Context Hygiene Hooks β€” βœ… Implemented as anti_drift.rs (851 lines, 25 tests). PreCompletion: pollution scoring, turn eviction, repetitive-attempt collapse, format anchor re-injection. PostCompletion: thinking tag stripping, babble collapse. Ref: commit 56dedce, src/agent/anti_drift.rs

  • I8: SearXNG search backend β€” Replace Brave Search (API key required, rate-limited) with SearXNG (free, local, unlimited). 3 touchpoints: 1) schema.rs: add provider: String ("brave"|"searxng", default "searxng") + searxng_url: String (default "http://localhost:8888") to WebSearchConfig. 2) web.rs: add SearXNG path in execute() β€” GET {url}/search?q={query}&format=json, parse results[].title/url/content. No API key needed. 3) registry.rs+tool_wiring.rs: extend ToolConfig to carry search_provider+searxng_url. Fallback: if SearXNG unreachable and Brave key set, use Brave. If neither, helpful error: "Run docker run -d -p 8888:8080 searxng/searxng or set a Brave API key." Onboard integration: cmd_onboard() prints Docker one-liner. Optional nanobot onboard --search auto-pulls+starts+configures SearXNG. SearXNG container tested working 2026-02-20: docker run -d -p 8888:8080 --name searxng searxng/searxng + enable JSON format in settings.yml. QW2 (2026-02-23): Recovery hints added to web_search errors (status-specific: 401/422/429/5xx).

  • I7: Lossless Context Management (LCM) (supersedes matryoshka design) β€” βœ… DAG-based lossless compaction per Ehrlich & Blackman (2026). Immutable store (session JSONL) + Summary DAG with pointers to originals + active context assembly. Implemented: src/agent/lcm.rs (~1100 lines, 17 tests): SummaryDag, LcmEngine (ingest/compact/expand), three-level escalation (preserve_details β†’ bullet_points β†’ deterministic truncate), dual-threshold control loop (Ο„_soft 50% / Ο„_hard 85%). LcmSchemaConfig in config schema. Wired into agent_loop.rs. lcm_expand tool registered when LCM enabled. E2E verified (2026-02-21): Real E2E test against nemotron-nano-12b on LM Studio: 12 messages through process_direct β†’ compaction triggered at Ο„_soft β†’ Level 2 summary created β†’ DAG node with lossless source IDs β†’ expand() retrieves originals. 6 invariants checked (store lossless, active shrinks, DAG populated, source IDs resolve, Summary entries present, expand works). Benchmark across 4 models: qwen3-0.6b best compressor (83.2% compression, 3.4s), nemotron-nano-12b fastest (81.4%, 2.8s). Bigger models (gemma-3n-e4b 54.6%, qwen3-1.7b 72.8%) produce more verbose summaries β€” worse for compaction. Remaining: Performance profiling under sustained load. Verify lcm_expand actually invoked by LLM during conversation. Persist DAG across session rotations. Compounds with: I6 (anti-drift cleans within summaries). B9 (pre-flight truncation as safety net). I3 (ContentGate decides raw vs summary). Ref: src/agent/lcm.rs, src/config/schema.rs:1219

  • [~] I10: /clear and /new REPL commands β€” Manual context reset for local models with small context windows (4K-8K). (Gateway /clear shipped 2026-02-23 β€” clears working memory + history via gateway_commands.rs.) Essential for trio mode where the Main model accumulates full conversation history while Router and Specialist are already stateless (ephemeral per-turn message arrays). /clear β€” reset working context, keep session:

    1. If LCM enabled: compact all ctx.messages into a single summary node. Model starts "fresh" but can lcm_expand to retrieve originals.
    2. If LCM disabled: truncate messages, carry forward last 2 as context seed.
    3. Reset working memory section of CONTEXT.md.
    4. Emit --- context cleared (N messages compacted) --- in REPL.
    5. Session JSONL continues β€” no data lost from the audit trail. /new [name] β€” fresh session entirely:
    6. Start a new session file (new JSONL, clean slate).
    7. Optional name parameter, otherwise auto-generate.
    8. Existing session stays on disk, accessible via /sessions. Scope: Only Main model context needs reset β€” Router and Specialist are already ephemeral (verified in router.rs: tool_messages/router_messages built fresh each turn, specialist_messages built fresh each dispatch). No per-role action needed. Compounds with: I7/LCM (compaction on clear), B11 (health status after clear), N6 (status injection reset). Ref: src/agent/router.rs (trio context isolation), src/agent/agent_loop.rs (LCM wiring), src/session/manager.rs (session lifecycle)
  • I11: Extract iteration state machine from agent_loop.rs β€” agent_loop.rs is ~4,455 lines. The 5-phase state machine (Preparing/PreCall/Calling/Processing/Executing) is untestable in isolation. Create src/agent/iteration.rs with IterationPhase, IterationOutcome, StepResult enums and 5 step_* free functions taking (&TurnContext, &AgentLoopShared). Gateway runner and process_message() stay in agent_loop.rs. Ref: REVIEW-2026-02-23.md Β§1.2

  • I12: Feature-gate dead Phase 2-3 code β€” lora_bridge.rs (737 lines), step_voter.rs, confidence_gate.rs, knowledge_store.rs (681 lines) are dead code. Move behind #[cfg(feature = "experimental")]. Define Phase 0 completion criteria: trio >95% accuracy, zero tool failures, session recall works, /local hot-swap 100%. Ref: REVIEW-2026-02-23.md Β§2.3

  • I13: Integration test harness β€” Zero integration tests. Create harness running process_message with mock LLM provider. Test: trio end-to-end (router β†’ specialist β†’ main), context overflow β†’ compaction β†’ persistence, tool failure β†’ recovery hint. Ref: REVIEW-2026-02-23.md Β§2.4

  • I14: Trio decision chain live visibility β€” Trio mode has ~3s silence (router 900ms + specialist 800ms + main 1200ms). Emit status through text_delta_tx in dim text: [router] analyzing..., [specialist] processing..., [main] synthesizing.... Ref: REVIEW-2026-02-23.md Β§3.1, src/agent/router.rs, agent_loop.rs

🟒 Nice to have β€” Phase 0

  • N1: Auto hardware detection β€” Detect VRAM/RAM/CPU, auto-assign tier (Potato/Sweet/Power/Beast), select quant level. nanobot doctor command.
  • N2: nanobot setup β€” Interactive first-run: detect hardware, download models, generate optimal config.
  • N3: Streaming rewrite β€” Incremental markdown renderer, line-by-line syntax highlighting, no full-response rerender. Ref: docs/plans/streaming-rewrite.md
  • N4: Full-duplex REPL β€” ESC+ESC instant cancel, backtick injection prompt, priority message channel. Ref: docs/plans/full-duplex-repl.md
  • [~] N5: Thinking toggle β€” /think command works in REPL + gateway channels. Remaining: Ctrl+T toggle for REPL. Ref: docs/plans/thinking-toggle.md
  • N6: Status injection β€” Auto-inject background worker status into context each turn. (Spacebot idea)
  • N8: Narration stress test β€” Validate narration compliance across local models. Ref: docs/plans/narration-stress-test.md

Phase 1: Million-Token Context (next)

  • P1.1: File-backed volumes β€” MappedVolume struct with mmap + line-offset index
  • P1.2: Chunk index β€” 4K-char chunks, simhash signatures, ctx_search
  • P1.3: Semantic index β€” Optional e5-small embeddings, vector similarity
  • P1.4: Proof β€” Needle-in-haystack at 1M tokens, 95%+ recall, <60s

Phase 2: Million-Step Processes (later)

  • P2.0: Calibration run β€” Measure per-step p on 1K-10K steps using winning RLM model from E3
  • P2.1: MAKER voting β€” first_to_ahead_by_k, red-flagging, output token cap
  • P2.2: MAD decomposition β€” Atomic step definitions per domain
  • P2.3: Process tree β€” Persistent execution tree, checkpoint/resume
  • P2.4: RLM completion β€” ctx_summarize, recursive depth, smart short-circuit. Ref: docs/plans/rlm-completion-proposal.md, docs/plans/adaptive_rlm_design.md
  • P2.5: Swarm architecture β€” Workers spawn Workers, budget propagation. Ref: docs/plans/swarm-architecture.md
  • P2.6: Event log pipeline β€” Append-only JSONL, pipeline runner. Ref: docs/plans/event-log-pipeline.md
  • P2.7: Proof β€” Towers of Hanoi 20 disks, 1M+ steps, zero errors, local only

Phase 3: Self-Evolving Agent (future)

  • P3.1: Trace logger β€” Structured JSONL per process
  • P3.2: Skill crystallization β€” Auto-create skills from repeated successes
  • P3.3: Budget calibration β€” Per-task-type stats in SQLite
  • P3.4: LoRA distillation β€” Export traces β†’ Zero pipeline β†’ hot-swap LoRA

Experiment Plan: Local Trio Evaluation

Reduce assumptions one at a time. No coding until we know what works.

New Default Trio (RTX 3090, 24GB VRAM)

Role Model Size Why
Main gemma-3n-e4b-it ~4B effective Fast, good chat, small footprint
Orchestrator nvidia_orchestrator-8b 8B 10/10 routing accuracy (proven in experiments/)
Specialist ministral-3-8b-instruct-2512 8B Strong tool-calling, instruction following

What We Know

  • Nemotron Orchestrator: 10/10 routing (vs NanBeige 6/10). Purpose-built. Proven.
  • NanBeige 3B: Good with <think>\n</think>\n\n prefill, but weak as router.
  • Main + Orchestrator work well together in practice.
  • Sequential self-routing would add latency vs parallel separation. Keep roles split.
  • Router single-action bottleneck: request_strict_router_decision() returns ONE RouterDecision. Multi-step tasks (fetch 2 URLs + synthesize) cannot be expressed. The router picks one tool and the pipeline stalls. See I9 for solution.
  • Deterministic fallback too narrow: Fixed in B3.1. router_fallback.rs now has 9 patterns (research+URL, plain URL, HN, latest news, read, write, edit, list, search, exec) + default ask_user. All guarded by has_tool().
  • L1 Concept router validated (2026-02-21): all-MiniLM-L6-v2 centroid classification: 24/30 (80%) vs orchestrator-8b 13/30 (43%). 5ms vs 637ms. 0 VRAM vs 6GB. 100% on non-ambiguous queries. 5/5 multilingual. Failures are all pragmatic/vague β€” exactly the cases where LLM reasoning adds value. Data: experiments/lcm-routing/.
  • L2 Multi-step templates built: 7 templates (research_and_summarize, read_and_analyze, fetch_and_compare, search_and_update, check_and_report, plan_and_implement, verify_and_fix). Max 4 steps. Limitation: rigid patterns can't scale to 10-100+ step workflows or handle failures/branching. Orchestrator model needed for dynamic planning. See I9.
  • Specialist has no tools: dispatch_specialist() sends a single-shot chat β€” no tool access. Can synthesize given context but cannot fetch/execute. Update (commit 3580c38): Specialist now also used for content gate admission via admit_with_specialist() β€” generates structural briefings for the context gate.
  • Trio never tested end-to-end: Resolved. B8 Done entry confirms E2E verification: delegation_mode=Trio in log, Mainβ†’Routerβ†’Specialist flow completed real tasks through LM Studio. Further hardened by trio E2E test runner (commit acbc738) with failure classification and adaptive retries.
  • 2026-02-21 diagnostic: Trio mode didn't activate. NanBeige ran as Inline main with full tool schemas. 21 metrics entries show tool_calls_requested: 1, tool_calls_executed: 0 β€” model generated tool calls (proving it had tool schemas) that were blocked as duplicates. Compaction crashed twice (n_keep 12620 >= n_ctx 8192). Fixed: B8 (metrics + circuit breaker) and B9 (tool guard replay + compaction overflow) shipped. Death spiral no longer occurs. Remaining: wire trio activation so NanBeige runs in Trio mode, not Inline.
  • System prompt is ~15-20K tokens even before conversation starts. Opus first call: prompt_tokens: 21705. A 3B model with 8K context has zero room. Even with 32K context, 15K of prompt leaves only 17K for conversation β€” and most of that prompt is AGENTS.md/SOUL.md/TOOLS.md that small models can't follow anyway.
  • Metrics broken for local models β€” Fixed in B8 (commit 0f80ad9). Token counts now captured from llama.cpp usage field.

Experiments Needed (one assumption at a time)

E1: Role Evaluation Matrix

Test each candidate model in each role independently.

Model As Main As Orchestrator As Specialist As RLM
gemma-3n-e4b-it ? ? ? ?
nvidia_orchestrator-8b βœ… 10/10 routing βœ… proven ? ?
ministral-3-8b-instruct-2512 ? ? ? ?
nanbeige4.1-3b ? 6/10 ? ?

Test bench per role:

  • Main: 10 conversation tasks (chat quality, coherence, narration compliance)
  • Orchestrator: 10 routing cases (existing test suite from experiments/)
  • Specialist: 10 tool-calling tasks (file ops, exec, multi-step)
  • RLM: 5 delegation loops (multi-step file edit, research, build cycle)

E2: VRAM Profile Testing

Critical for "3 impossible things" β€” must work across hardware tiers.

Tier VRAM Trio Budget Candidate Combos
Potato 4-6 GB ~4B total 1 model does all?
Sweet 8-12 GB ~12B total 2 small models
Power 16-24 GB ~24B total Full trio (current target)
Beast 48+ GB Unlimited Bigger specialists

E3: RLM Model Shootout

The key unknown. Test candidates on delegation loop benchmarks:

  • Multi-step file edit (read β†’ plan β†’ edit β†’ verify)
  • Web research synthesis (search β†’ fetch β†’ summarize)
  • Build cycle (edit β†’ compile β†’ fix errors β†’ retry)

Metrics: completion rate, token cost, latency, error recovery.

E4: Integration Test

Once E1-E3 identify winners, run full nanobot session with the new trio. Compare against current setup on real tasks.

Experiment Order

  1. E1 first β€” know what each model can do in each role
  2. E3 next β€” find the RLM (biggest unknown)
  3. E2 then β€” scale findings across VRAM tiers
  4. E4 last β€” validate the winning combo end-to-end

Spacebot Ideas (parking lot)

Captured from spacebot. Ideas only, no code.

Idea Status Mapped to
Non-blocking compaction βœ… Absorbed into I7 (matryoshka) Phase 0
Status injection Backlog N6 Phase 0
Message coalescing Backlog N7 Phase 0
Branch concept (context-fork) Not started Phase 2 (related to swarm)
Prompt complexity routing Backlog I5 Phase 0
Memory bulletin (Cortex) Not started Phase 3 (related to memory)

Done βœ…

  • Continuous Voice Mode (Realtime Module) β€” VAD-based hands-free voice pipeline with streaming LLMβ†’TTS output and barge-in support. InputMode enum (Continuous/PushToTalk) in session.rs. Pure next_state() state machine (VoiceAgentState: Listening/Processing/Speaking) with VoiceAction dispatch. LlmProcessor trait abstracts LLM calls (production AgentLoopProcessor wraps process_direct_streaming(), MockLlmProcessor/SlowMockLlmProcessor for tests). run_voice_event_loop() β€” event-driven loop: receives RealtimeEvent, runs state machine, spawns LLM tasks with CancellationToken for barge-in. drive_llm_to_tts() bridges LLM deltas through SentenceAccumulator to TTS. start_continuous_capture() β€” AudioCapture bridge with try_send() backpressure (drop-oldest for real-time). CLI --mode flag (continuous/ptt). 27 new tests (13 voice_agent state machine + LLM + barge-in, 13 session config + capture, 1 CLI parsing), all green. 5 files changed, +875 lines. (2026-03-03, src/realtime/voice_agent.rs, src/realtime/session.rs, src/cli.rs)
  • B11: Heartbeat as foundational liveness service β€” HealthRegistry with pluggable HealthProbe trait, config-driven probe registration via build_registry(). First probe: LcmCompactionProbe (GET /health, 5s timeout, 60s interval, 3-failure degradation threshold). Critical fix: 30s timeout guard on both compaction spawns β€” in_flight always resets even on timeout/hang. Pre-flight check skips LCM compaction when endpoint degraded. Wired into HeartbeatService (Layer 0), AgentLoop, CLI, REPL. /status shows probe health with color indicators. 25+ tests (health module + TrioEndpointProbe + timeout guard), 1526 total green. (2026-02-21, commits 3bb1161, 6c71866, 1454240, src/heartbeat/health.rs, src/agent/agent_loop.rs)
  • I7: Lossless Context Management (LCM) β€” DAG-based lossless compaction. LcmEngine with three-level escalation (LLM preserve_details β†’ bullet_points β†’ deterministic truncate). Dual-threshold control loop (Ο„_soft/Ο„_hard). lcm_expand tool for lossless retrieval. E2E verified against 4 local models: qwen3-0.6b best compressor (83.2%, 3.4s), nemotron-nano-12b fastest (81.4%, 2.8s). 17 tests (4 mock E2E + 1 real E2E + 1 benchmark + 4 config + 9 unit). 1526 total green. (2026-02-21, src/agent/lcm.rs, commits 0697bd4, 9893d91, bde583f, 72b94c8)
  • B3: Update default local trio β€” Trio configured: Main gemma-3n-e4b-it, Router nvidia_orchestrator-8b, Specialist ministral-3-8b-instruct-2512. Explicit config + B10 auto-detect. (2026-02-21)
  • B3.1: Smarter deterministic fallback β€” router_fallback.rs: 9 deterministic patterns + default ask_user (was 2). Patterns: research+URLβ†’spawn researcher, plain URL/HNβ†’web_fetch, latest newsβ†’spawn, read/show+pathβ†’read_file, write/create+pathβ†’write_file, edit/modify+pathβ†’edit_file, list/lsβ†’list_dir, run/execute/cargoβ†’exec, searchβ†’web_search. All guarded by has_tool(). 19 tests. (2026-02-21, src/agent/router_fallback.rs)
  • B4: Multi-model config schema β€” Closed as obsolete. TrioConfig provides per-role model/port/endpoint. LM Studio JIT-loads models; no separate server spawning needed. (2026-02-21)
  • B8: Trio mode activation & role-scoped context β€” All 5 steps complete. Metrics + circuit breaker (commit 0f80ad9). Auto-activation + auto-detect as B10 (commit 3774742). E2E verified: delegation_mode=Trio in log, Main emits natural language, Router preflight intercepts, Specialist executes tool. (2026-02-21, src/agent/router.rs)
  • Session indexer + REPL /sessions command β€” Bridge between raw JSONL sessions (230 files, 116MB) and searchable SESSION_.md memory files. session_indexer.rs: pure extract_session_content() + index_sessions() orchestrator (extracts user+assistant messages, skips tool results, caps at 50 messages, truncates to 500 chars each). REPL: /sessions command with list/export/purge/archive/index subcommands (/ss alias). CLI: nanobot sessions index. Fixed process::exit(1) in sessions_cmd.rs for REPL safety. Updated recall tool description. E2E verified: 149 sessions indexed (6β†’155 SESSION_.md), idempotent re-run, grep finds content. 17 new tests, 1395 total green. (2026-02-21, src/agent/session_indexer.rs)
  • B10: Auto-detect trio models from LM Studio β€” pick_trio_models() scans available LMS models at startup for "orchestrator"/"router" (router) and "function-calling"/"instruct"/"ministral" (specialist) patterns. Only fills empty config slots β€” explicit config always wins. Fuzzy main-model exclusion handles org prefixes and unresolved GGUF hints. Wired into REPL startup before auto-activation. 13 tests including e2e flow and real LMS model list. (2026-02-21, commit 3774742)
  • B9: Compaction safety guard + tool guard death spiral β€” Tool guard replays cached results instead of injecting error messages small models can't parse. Compaction respects summarizer model's actual context window via compaction_model_context_size config + pre-flight truncation (0.7 safety margin). Circuit breaker threshold 3β†’2. E2E verified against NanBeige on LM Studio. (2026-02-21, commit 0f7f365)
  • B8: Metrics accuracy + tool loop circuit breaker β€” Fixed local model metrics capture (prompt_tokens, completion_tokens, elapsed_ms). Added circuit breaker for consecutive all-blocked tool call rounds. (2026-02-21, commit 0f80ad9)
  • B7: Provider retry with backon β€” Replaced 3 hand-rolled retry loops with backon crate. Shared is_retryable_provider_error() predicate. Added retry to streaming path. (2026-02-21, commit 640bdc9)
  • B6: SLM provider observability β€” 8 silent failure paths now logged. #[instrument] spans on chat()/chat_stream(). Promoted llm_call_failed to warn!. (2026-02-21, commit 0b6bc5f)
  • Fix: Audit log hash chain race condition β€” record() had a TOCTOU bug: seq allocation + prev_hash read were not serialized under the file lock. Two concurrent executors (tool_runner + inline) both read seq 940 and wrote seq 941 with the same prev_hash, forking the chain at entry 942. Fix: acquire file lock first, re-read authoritative seq + prev_hash from file under lock, then compute hash and write. 12/12 audit tests pass. (2026-02-21, commit 835cf6d, src/agent/audit.rs)
  • B1: 132 compiler warnings β†’ 0 warnings (2026-02-20)
  • B2: 2 test failures β†’ 1429 pass, 0 fail (2026-02-21)
  • Fix: Subprocess stdin steal β€” .stdin(Stdio::null()) on all 4 spawn sites in shell.rs + worker_tools.rs (2026-02-20)
  • Fix: Esc-mashing freezes REPL β€” drain_stdin() after cancel (2026-02-20, commit 57ec883)
  • Fix stale comment in ensure_compaction_model (2026-02-17)
  • Raise tool result truncation threshold (2026-02-17)
  • Document multi-session CONTEXT.md race (2026-02-17)
  • Input box disappears during streaming (2026-02-17)
  • Agent interruption too slow (2026-02-17)
  • Subagent improvements (wait, output files, budget, compaction) (2026-02-18)
  • Tool runner infinite loop fix (2026-02-18)
  • Specialist content gate, daily notes reader, auto-LCM for local β€” admit_with_specialist() in context_gate.rs: specialist LLM generates structural briefing for content gate admission. read_recent_daily_notes() in memory.rs: reads last N daily notes for context. Auto-LCM: cli.rs enables LCM automatically for local-mode sessions. TOOLS.md/IDENTITY.md workspace template auto-creation. (2026-02-21, commit 3580c38)
  • Trio resilience: top_p, health probes, circuit breaker gating β€” top_p parameter added to LLMProvider trait + all provider implementations (OpenAI-compat, Anthropic). TrioEndpointProbe health probes for router/specialist endpoints (4 new health tests). Circuit breaker gating on router_preflight() and dispatch_specialist() β€” skips trio roles when endpoint degraded. HealthRegistry injected into TurnContext for runtime health checks. /status improvements. (2026-02-21, commit 1454240)
  • Trio E2E test runner with failure classification β€” scripts/test_trio_e2e.sh: automated E2E test harness for trio mode. Failure classification (INFRA_DOWN / JIT_LOADING / TIMEOUT / MODEL_QUALITY / BUG) with adaptive retry delays. Preflight canary validates endpoint liveness before test runs. Summary reports with pass/fail counts. Auto-repair protocol for transient failures. (2026-02-21, commit acbc738)
  • QW1: Per-tool guard limits β€” Replaced blanket max_same_tool_call_per_turn: 1 with per-tool-category limits. Read-only tools (read_file, list_dir, recall, read_skill, web_search, web_fetch) get limit 2 (originally 5, tightened 2026-02-28 to block redundant identical reads β€” cache replay handles blocked calls). All other tools use default 3 (raised from 1). ToolGuard now has tool_limits: HashMap<String, u32> populated from READ_TOOLS constant. 7 tests (added test_read_tool_different_args_unlimited), 1679+ total green. (2026-02-23, updated 2026-02-28, src/agent/tool_guard.rs, src/config/schema.rs)
  • QW2: Recovery hints on tool errors β€” Added actionable Hint: suffixes to error messages in filesystem, web, and shell tools. Filesystem: file-not-found β†’ "use list_dir", permission-denied β†’ "check permissions", old_text-not-found β†’ "use read_file for exact text". Web: Brave 401/403 β†’ "check API key", 422 β†’ "subscription may be inactive", 429 β†’ "rate limited, wait", 5xx β†’ "service error". Shell: timeout β†’ "try simpler command", exec error β†’ "verify command in PATH". 9 new tests. (2026-02-23, src/agent/tools/filesystem.rs, src/agent/tools/web.rs, src/agent/tools/shell.rs)
  • Strategic review β€” Comprehensive review saved as REVIEW-2026-02-23.md. 5 architectural recommendations, 5 strategic recommendations, 5 UX recommendations, 5 quick wins, 5 anti-patterns. Recommended 4-week execution order. (2026-02-23)
  • QW3: Gateway slash commands + trio auto-detection β€” 10 slash commands (/start, /help, /status, /clear, /agents, /kill, /think, /long, /context, /memory) intercepted before LLM processing in gateway mode via gateway_commands.rs. Anti-coalesce guard prevents commands from being batched with adjacent messages. Telegram bot-mention suffix stripping (/status@my_bot β†’ /status). Trio auto-detection added to cmd_gateway() (was REPL-only). Fixed is_local hardcoded to false in gateway. (2026-02-23, src/agent/gateway_commands.rs, src/agent/agent_loop.rs, src/cli.rs)
  • QW4: Exec tool working directory + tool call dedup β€” Three-layer fix for exec tool defaulting to workspace instead of process cwd: (1) tool_wiring.rs sets exec_working_dir to process cwd, (2) registry.rs fallback chain: config β†’ process cwd β†’ workspace, (3) agent_loop.rs injects working_dir into exec calls when LLM omits it. Three-layer tool call dedup for local models emitting duplicate calls: (1) batch dedup in step_execute_tools(), (2) initial batch dedup in run_tool_loop(), (3) cross-iteration dedup via seen_calls in analyze_via_scratch_pad(). System prompt improved: "Rules" section with "ALWAYS use tools, NEVER guess" + "File Operations" section with pwd-first guidance. /m command fixed for remote LM Studio (queries API instead of scanning ~/models/). (2026-02-23, src/agent/agent_loop.rs, src/agent/tool_wiring.rs, src/agent/tools/registry.rs, src/agent/context.rs, src/repl/commands.rs)
  • N7: Message coalescing β€” Already implemented: 400ms coalesce window in agent_loop.rs:run() batches rapid messages from same session (Telegram/WhatsApp). Anti-coalesce guard added for slash commands. (Pre-existing, noted 2026-02-23)