Nanobot Backlog

Single source of truth for all actionable work. ROADMAP.md = vision. This file = what to do next.

Ways of Working

The codebase has coherence debt. Features are designed but not wired end-to-end. Trio mode exists in docs and config flags but never completed a real session. Compaction exists but crashes when it matters most. Context budgets exist but aren't enforced per role. The pattern: design → partial implement → patch holes → next feature.

New rule: every blocking item must have a verification step. "Done" means: ran a real session with local models, checked metrics, checked logs, confirmed the feature actually fired. Not "code compiles and tests pass."

When working with Claude Code on these items:

One item at a time. Don't let it "also fix" adjacent things.

Start with a failing test or reproduction. Show the broken state first.

Verify with metrics.jsonl + nanobot.log after every change.

Read the existing code before writing. Half these bugs are "feature exists but isn't called."

B3-B11 are green. I7/LCM E2E verified. L1 concept router 80% accurate (2x orchestrator). QW1-QW4 shipped. Blockers remaining: B5 (experiments), B12 (config debt — proposal needed). Next priority: B12 proposal → I9 (tiered routing) → I8 (SearXNG).

Anti-patterns (from 2026-02-23 review):

Stop adding router parsing fallbacks — 4 parsers mask model quality. Use constrained decoding instead.

Stop adding config fields without removing others — 90+ fields = untestable. Feature-gate experiments, don't add runtime toggles.

Stop building Phase 2-3 code during Phase 0 — write designs in docs, not code in src/.

Stop using #![allow(dead_code)] at module level — apply per-item only.

Stop mixing concerns in TurnContext — decompose 28-field struct into sub-structs with explicit dependencies.

Phase 0: Foundation (current)

🔴 Blocking — do first

B11: Heartbeat as foundational liveness service ⚡ — Priority. The current heartbeat is a glorified cron — shell commands + optional HEARTBEAT.md check. It needs to become the central liveness and health service that all modes (text/voice), channels (CLI/Telegram/WhatsApp), and configurations (local trio/cloud) depend on. Current state (verified 2026-02-21): Runs one hardcoded command (qmd update -c sessions), LLM callback always None (never wired), #![allow(dead_code)] on the module. Zero awareness of providers, endpoints, channels, or self-health. Health probes needed:

Probe	What	Frequency	Action on failure
Provider health	Ping each configured provider (`/v1/models` or equivalent)	Every tick	Mark provider unavailable, trigger fallback chain
LCM compactor	Ping `compactionEndpoint.url` + verify model loaded	Every tick	Set `lcm.available = false`, skip compaction gracefully, warn user
Trio models	Verify router/specialist models loaded in LM Studio	Every tick (local only)	Degrade to inline mode, log which role is missing
Search backend	Ping SearXNG / check Brave API key validity	Every 5 ticks	Disable `web_search` tool, surface in `/status`
Channel liveness	Telegram/WhatsApp/Email connection state	Per-channel interval	Reconnect with backoff, surface in `/status`
Self-health	Session size, memory pressure, disk space	Every tick	Trigger compaction, warn user, auto-rotate session
Critical gap — LCM compaction has no pre-flight health check: When qwen3-0.6b is unreachable, Level 1+2 silently fail → Level 3 deterministic fallback fires (works, but lossy). No warning, no status surfaced. If the LLM hangs (not down, just slow), the compaction spawn blocks indefinitely and `in_flight` never resets — blocking ALL future compaction for the session.
Design principles:

Channel-agnostic — same health loop whether CLI, Telegram, or voice mode
Mode-agnostic — works for local trio, cloud, and hybrid configurations
Graceful degradation — failures disable features, never crash. Compactor down → LCM pauses. Router gone → deterministic routing. Provider down → queue + retry.
Observable — /status REPL command shows all probe states. Status injectable into context (connects to N6).
No LLM required — health probes are HTTP pings and process checks, not agent tasks. Existing Layer 2 (HEARTBEAT.md → agent) stays as optional add-on. Implementation sketch:

HealthRegistry — register probes at startup based on config (if lcm.enabled, register LCM probe; if trio, register trio probes; etc.)
Each probe: name, check() → Result<(), String>, interval, on_failure callback
HeartbeatService runs the registry on each tick, stores probe states in SystemState
/status command reads probe states
Provider/router/compaction code checks probe state before attempting calls
Timeout guard on compaction spawn (default 30s) — kill task, reset in_flight, log error Compounds with: N6 (status injection), I8 (SearXNG health), I4 (multi-provider fallback), I7 (LCM compactor availability), I9 (tiered routing needs health-aware escalation). Ref: src/heartbeat/service.rs, src/agent/agent_loop.rs (compaction spawn), src/agent/system_state.rs

B3: Update default local trio — ✅ Trio configured: Main gemma-3n-e4b-it (server.rs:18), Router nvidia_orchestrator-8b, Specialist ministral-3-8b-instruct-2512 (both in config.json trio section + B10 auto-detect as fallback). TrioConfig::default() has empty strings but runtime always populated via explicit config or auto-detect.
B4: Multi-model config schema — ✅ Obsolete as scoped. TrioConfig already provides per-role model/port/ctx_tokens/temperature/endpoint for router and specialist. LM Studio JIT-loads models on demand — no need for nanobot to spawn separate llama-server instances. The local.rlm slot became the specialist role; local.memory uses memory.model config.
B5: RLM model evaluation — Systematic experiments to find best RLM model per VRAM tier. Critical for "3 impossible things". See experiment plan below. Routing benchmarks started in experiments/lcm-routing/ (orchestrator_bench.py, test_bench.py).
B8: Trio mode activation & role-scoped context ⚡ — ✅ All 5 steps complete. Metrics + circuit breaker (commit 0f80ad9). Auto-activation + auto-detect as B10 (commit 3774742). E2E verified: local session → delegation_mode=Trio in log → Main emits natural language → Router preflight intercepts → Specialist executes tool. Ref: src/agent/router.rs

✅ Codebase Health (Phases 1-4 complete, 2026-03-04)

Phase 1: parking_lot::Mutex (18 files), LazyLock regexes (20+ sites)
Phase 2: Credential redaction (custom Debug impls), removed dead needs_user_continuation
Phase 3: ToolError retryability + execute_with_retry() with exponential backoff
Phase 4: File splits — tool_runner.rs tests extracted, cli.rs → src/cli/ module (6 files), agent_loop.rs → 3 files via #[path], commands.rs → 7 files via #[path]. All 4117 tests pass.
Phase 5: Shell injection adversarial tests (15 evasion patterns), LM Studio --jinja probe, MLX in-process provider wiring — MlxProvider connected to all 4 entry points (REPL, gateway, voice, rebuild). Config: inferenceEngine: "mlx". Perplexity gate auto-enabled when MLX active. 4 E2E tests (chat, perplexity, train, closed-loop). 4157 tests pass.

🟡 Important — do soon

[~] B12: Configuration debt — eliminate hardcoded magic values (Approach E implemented: Phases 1-3 complete, Phase 4 deferred)
- ✅ Phase 1: ModelCapabilities registry (commit 22107ad) — Eliminated 7+ model name sniffing sites across tool_runner.rs, compaction.rs, agent_core.rs, thread_repair.rs, subagent.rs. Capability flags (supports_tool_calling, supports_thinking, max_reliable_output, etc.) replace model.contains("nanbeige") string matches. Config-overridable via modelCapabilities map. File: src/agent/model_capabilities.rs (348 lines, 24 tests).
- ✅ Phase 2-3: Module-local configs (commit 22107ad) — CircuitBreaker, Subagent, Session, Compaction, and Hygiene tuning moved from hardcoded const values to schema-backed structs with #[serde(default)]. File: src/config/schema.rs (+215 lines). Result: ~50 hardcoded values are now configurable without touching module source. Existing config.json files unchanged (all new fields use #[serde(default)]).
- ❌ Phase 4: Deferred — Lower priority; items below are I/O-coupled domain knowledge where config-driven tuning adds less value than Phases 1-3:
  - HeartbeatConfig (health probe intervals and failure thresholds)
  - PipelineTuning (step iteration limits for I9 tiered routing)
  - ProvenanceConfig (audit log extension fields)
  - Rationale: Phases 1-3 addressed 83% of the abstraction debt (model sniffing + numeric knobs). Phase 4 items can be added incrementally as those features mature.
- Status as of: 2026-02-21, commit 22107ad
- Design: docs/plans/b12-config-debt-elimination.md (implemented per Approach E)
- Compounds with: B11 (heartbeat needs config-driven probes — Phase 4), I9 (tiered routing needs configurable thresholds — Phase 4), N1 (hardware auto-detection feeds profile selection).
- Ref: src/agent/model_capabilities.rs, src/config/schema.rs
I0: Trio pipeline actions — Router can only emit ONE action per turn. Multi-step tasks (research + synthesize) fail because the router picks one tool and stops. Need pipeline-as-first-class router output + shared scratchpad between trio roles. Superseded by I9 for the routing layer — I0 remains relevant for the execution/scratchpad side. Ref: thoughts/shared/plans/2026-02-20-trio-pipeline-architecture.md

I9: Tiered routing with orchestrator escalation ⚡ — Priority. The L1 concept router (all-MiniLM-L6-v2, centroid classification) proved 80% accurate at 5ms/0 VRAM vs orchestrator-8b's 43% at 637ms/6GB. But rigid template matching (L2) caps at ~7 predefined multi-step patterns. Real workflows need 10-100+ steps with dynamic re-planning, conditional branching, and error recovery — templates can't express this. Architecture: Three-tier routing with orchestrator escalation:

Tier	Engine	Latency	When	Traffic %
T1: Concept Router	Embedding centroid (CPU)	~5ms	Unambiguous single-action queries	~70%
T2: Template Expander	Embedding → template match	~10ms	Known multi-step patterns (L2 templates)	~15%
T3: Orchestrator	Reasoning LLM (nemotron-orchestrator-8b or nanbeige)	~600ms	Complex/novel/failing workflows, low-confidence T1	~15%
Escalation triggers (T1→T3): (a) Cosine similarity margin <0.4 (low confidence). (b) No template match for detected multi-step intent. (c) Step failure mid-execution (re-plan). (d) User query references prior context (pragmatic ambiguity).
T3 orchestrator responsibilities: Dynamic step decomposition (not limited to templates). Mid-workflow re-planning when steps fail or return unexpected results. Conditional branching (if build fails → fix errors → retry). State tracking across 10-100+ steps via scratchpad. Budget/token cost monitoring.
Key insight from L1 experiments: The 6 concept router failures were all pragmatic (hedging, vagueness, context-dependent) — exactly the cases where an LLM reasoning model adds value. The concept router handles the ~70% easy cases at zero cost; the orchestrator handles the ~15% hard cases where reasoning matters. This is cheaper than running the orchestrator on 100% of traffic.
Implementation path: 1) Wire concept router into `router_preflight()` as fast path. 2) Add confidence threshold — below 0.4, escalate to LLM orchestrator. 3) Add step executor that consumes `Vec<RouterDecision>` from either T2 templates or T3 orchestrator. 4) Add shared scratchpad for multi-step state. 5) Add re-planning hook: when a step fails, send context + failure to T3 for new plan.
Compounds with: I0 (pipeline execution), L1/L2 experiments (`experiments/lcm-routing/`), B5 (model evaluation).
Ref: `experiments/lcm-routing/results/L1_analysis.md`, `experiments/lcm-routing/multi_step_templates.json`

I1: Local role/protocol crashes — Fix system role crash, alternation crash, orphan tool messages. Thread repair pipeline exists but needs hardening. Ref: docs/plans/local-trio-strategy-2026-02-18.md, docs/plans/local-model-reliability-tdd.md QW1 (2026-02-23): Tool guard per-tool limits shipped — read tools get limit 2 (tightened from 5, 2026-02-28), write tools keep default 3. Multi-file read blockage resolved.
I2: Non-blocking compaction — ✅ Absorbed into I7 (matryoshka compaction). Per-cluster parallel summarization replaces the three-tier approach.
I3: Context Gate — Replace dumb char-limit truncation with ContentGate: pass raw / structural briefing / drill-down. Zero agent-facing API changes. Partial progress: admit_with_specialist() in context_gate.rs (commit 3580c38) provides the structural briefing path via specialist LLM. Ref: docs/plans/context-gate.md, docs/plans/context-protocol.md
I4: Multi-provider refactor — Break up SwappableCore god struct, extensible provider registry, fallback chains. Ref: docs/plans/multi-provider-refactor.md, docs/plans/nanobot_architecture_review.md
I5: Dynamic model router — Prompt complexity scoring (light/standard/heavy), auto-downgrade simple messages to cheaper models. Ref: docs/plans/dynamic-model-router.md
I6: Context Hygiene Hooks — ✅ Implemented as anti_drift.rs (851 lines, 25 tests). PreCompletion: pollution scoring, turn eviction, repetitive-attempt collapse, format anchor re-injection. PostCompletion: thinking tag stripping, babble collapse. Ref: commit 56dedce, src/agent/anti_drift.rs
I8: SearXNG search backend — Replace Brave Search (API key required, rate-limited) with SearXNG (free, local, unlimited). 3 touchpoints: 1) schema.rs: add provider: String ("brave"|"searxng", default "searxng") + searxng_url: String (default "http://localhost:8888") to WebSearchConfig. 2) web.rs: add SearXNG path in execute() — GET {url}/search?q={query}&format=json, parse results[].title/url/content. No API key needed. 3) registry.rs+tool_wiring.rs: extend ToolConfig to carry search_provider+searxng_url. Fallback: if SearXNG unreachable and Brave key set, use Brave. If neither, helpful error: "Run docker run -d -p 8888:8080 searxng/searxng or set a Brave API key." Onboard integration: cmd_onboard() prints Docker one-liner. Optional nanobot onboard --search auto-pulls+starts+configures SearXNG. SearXNG container tested working 2026-02-20: docker run -d -p 8888:8080 --name searxng searxng/searxng + enable JSON format in settings.yml. QW2 (2026-02-23): Recovery hints added to web_search errors (status-specific: 401/422/429/5xx).
I7: Lossless Context Management (LCM) (supersedes matryoshka design) — ✅ DAG-based lossless compaction per Ehrlich & Blackman (2026). Immutable store (session JSONL) + Summary DAG with pointers to originals + active context assembly. Implemented: src/agent/lcm.rs (~1100 lines, 17 tests): SummaryDag, LcmEngine (ingest/compact/expand), three-level escalation (preserve_details → bullet_points → deterministic truncate), dual-threshold control loop (τ_soft 50% / τ_hard 85%). LcmSchemaConfig in config schema. Wired into agent_loop.rs. lcm_expand tool registered when LCM enabled. E2E verified (2026-02-21): Real E2E test against nemotron-nano-12b on LM Studio: 12 messages through process_direct → compaction triggered at τ_soft → Level 2 summary created → DAG node with lossless source IDs → expand() retrieves originals. 6 invariants checked (store lossless, active shrinks, DAG populated, source IDs resolve, Summary entries present, expand works). Benchmark across 4 models: qwen3-0.6b best compressor (83.2% compression, 3.4s), nemotron-nano-12b fastest (81.4%, 2.8s). Bigger models (gemma-3n-e4b 54.6%, qwen3-1.7b 72.8%) produce more verbose summaries — worse for compaction. Remaining: Performance profiling under sustained load. Verify lcm_expand actually invoked by LLM during conversation. Persist DAG across session rotations. Compounds with: I6 (anti-drift cleans within summaries). B9 (pre-flight truncation as safety net). I3 (ContentGate decides raw vs summary). Ref: src/agent/lcm.rs, src/config/schema.rs:1219
[~] I10: /clear and /new REPL commands — Manual context reset for local models with small context windows (4K-8K). (Gateway /clear shipped 2026-02-23 — clears working memory + history via gateway_commands.rs.) Essential for trio mode where the Main model accumulates full conversation history while Router and Specialist are already stateless (ephemeral per-turn message arrays). /clear — reset working context, keep session:
1. If LCM enabled: compact all ctx.messages into a single summary node. Model starts "fresh" but can lcm_expand to retrieve originals.
2. If LCM disabled: truncate messages, carry forward last 2 as context seed.
3. Reset working memory section of CONTEXT.md.
4. Emit --- context cleared (N messages compacted) --- in REPL.
5. Session JSONL continues — no data lost from the audit trail. /new [name] — fresh session entirely:
6. Start a new session file (new JSONL, clean slate).
7. Optional name parameter, otherwise auto-generate.
8. Existing session stays on disk, accessible via /sessions. Scope: Only Main model context needs reset — Router and Specialist are already ephemeral (verified in router.rs: tool_messages/router_messages built fresh each turn, specialist_messages built fresh each dispatch). No per-role action needed. Compounds with: I7/LCM (compaction on clear), B11 (health status after clear), N6 (status injection reset). Ref: src/agent/router.rs (trio context isolation), src/agent/agent_loop.rs (LCM wiring), src/session/manager.rs (session lifecycle)
I11: Extract iteration state machine from agent_loop.rs — agent_loop.rs is ~4,455 lines. The 5-phase state machine (Preparing/PreCall/Calling/Processing/Executing) is untestable in isolation. Create src/agent/iteration.rs with IterationPhase, IterationOutcome, StepResult enums and 5 step_* free functions taking (&TurnContext, &AgentLoopShared). Gateway runner and process_message() stay in agent_loop.rs. Ref: REVIEW-2026-02-23.md §1.2
I12: Feature-gate dead Phase 2-3 code — lora_bridge.rs (737 lines), step_voter.rs, confidence_gate.rs, knowledge_store.rs (681 lines) are dead code. Move behind #[cfg(feature = "experimental")]. Define Phase 0 completion criteria: trio >95% accuracy, zero tool failures, session recall works, /local hot-swap 100%. Ref: REVIEW-2026-02-23.md §2.3
I13: Integration test harness — Zero integration tests. Create harness running process_message with mock LLM provider. Test: trio end-to-end (router → specialist → main), context overflow → compaction → persistence, tool failure → recovery hint. Ref: REVIEW-2026-02-23.md §2.4
I14: Trio decision chain live visibility — Trio mode has ~3s silence (router 900ms + specialist 800ms + main 1200ms). Emit status through text_delta_tx in dim text: [router] analyzing..., [specialist] processing..., [main] synthesizing.... Ref: REVIEW-2026-02-23.md §3.1, src/agent/router.rs, agent_loop.rs

🟢 Nice to have — Phase 0

N1: Auto hardware detection — Detect VRAM/RAM/CPU, auto-assign tier (Potato/Sweet/Power/Beast), select quant level. nanobot doctor command.
N2: nanobot setup — Interactive first-run: detect hardware, download models, generate optimal config.
N3: Streaming rewrite — Incremental markdown renderer, line-by-line syntax highlighting, no full-response rerender. Ref: docs/plans/streaming-rewrite.md
N4: Full-duplex REPL — ESC+ESC instant cancel, backtick injection prompt, priority message channel. Ref: docs/plans/full-duplex-repl.md
[~] N5: Thinking toggle — /think command works in REPL + gateway channels. Remaining: Ctrl+T toggle for REPL. Ref: docs/plans/thinking-toggle.md
N6: Status injection — Auto-inject background worker status into context each turn. (Spacebot idea)
N8: Narration stress test — Validate narration compliance across local models. Ref: docs/plans/narration-stress-test.md

Phase 1: Million-Token Context (next)

P1.1: File-backed volumes — MappedVolume struct with mmap + line-offset index
P1.2: Chunk index — 4K-char chunks, simhash signatures, ctx_search
P1.3: Semantic index — Optional e5-small embeddings, vector similarity
P1.4: Proof — Needle-in-haystack at 1M tokens, 95%+ recall, <60s

Phase 2: Million-Step Processes (later)

P2.0: Calibration run — Measure per-step p on 1K-10K steps using winning RLM model from E3
P2.1: MAKER voting — first_to_ahead_by_k, red-flagging, output token cap
P2.2: MAD decomposition — Atomic step definitions per domain
P2.3: Process tree — Persistent execution tree, checkpoint/resume
P2.4: RLM completion — ctx_summarize, recursive depth, smart short-circuit. Ref: docs/plans/rlm-completion-proposal.md, docs/plans/adaptive_rlm_design.md
P2.5: Swarm architecture — Workers spawn Workers, budget propagation. Ref: docs/plans/swarm-architecture.md
P2.6: Event log pipeline — Append-only JSONL, pipeline runner. Ref: docs/plans/event-log-pipeline.md
P2.7: Proof — Towers of Hanoi 20 disks, 1M+ steps, zero errors, local only

Phase 3: Self-Evolving Agent (future)

P3.1: Trace logger — Structured JSONL per process
P3.2: Skill crystallization — Auto-create skills from repeated successes
P3.3: Budget calibration — Per-task-type stats in SQLite
P3.4: LoRA distillation — Export traces → Zero pipeline → hot-swap LoRA

Experiment Plan: Local Trio Evaluation

Reduce assumptions one at a time. No coding until we know what works.

New Default Trio (RTX 3090, 24GB VRAM)

Role	Model	Size	Why
Main	gemma-3n-e4b-it	~4B effective	Fast, good chat, small footprint
Orchestrator	nvidia_orchestrator-8b	8B	10/10 routing accuracy (proven in experiments/)
Specialist	ministral-3-8b-instruct-2512	8B	Strong tool-calling, instruction following

What We Know

Nemotron Orchestrator: 10/10 routing (vs NanBeige 6/10). Purpose-built. Proven.
NanBeige 3B: Good with <think>\n</think>\n\n prefill, but weak as router.
Main + Orchestrator work well together in practice.
Sequential self-routing would add latency vs parallel separation. Keep roles split.
Router single-action bottleneck: request_strict_router_decision() returns ONE RouterDecision. Multi-step tasks (fetch 2 URLs + synthesize) cannot be expressed. The router picks one tool and the pipeline stalls. See I9 for solution.
~~Deterministic fallback too narrow~~: Fixed in B3.1. router_fallback.rs now has 9 patterns (research+URL, plain URL, HN, latest news, read, write, edit, list, search, exec) + default ask_user. All guarded by has_tool().
L1 Concept router validated (2026-02-21): all-MiniLM-L6-v2 centroid classification: 24/30 (80%) vs orchestrator-8b 13/30 (43%). 5ms vs 637ms. 0 VRAM vs 6GB. 100% on non-ambiguous queries. 5/5 multilingual. Failures are all pragmatic/vague — exactly the cases where LLM reasoning adds value. Data: experiments/lcm-routing/.
L2 Multi-step templates built: 7 templates (research_and_summarize, read_and_analyze, fetch_and_compare, search_and_update, check_and_report, plan_and_implement, verify_and_fix). Max 4 steps. Limitation: rigid patterns can't scale to 10-100+ step workflows or handle failures/branching. Orchestrator model needed for dynamic planning. See I9.
Specialist has no tools: dispatch_specialist() sends a single-shot chat — no tool access. Can synthesize given context but cannot fetch/execute. Update (commit 3580c38): Specialist now also used for content gate admission via admit_with_specialist() — generates structural briefings for the context gate.
~~Trio never tested end-to-end~~: Resolved. B8 Done entry confirms E2E verification: delegation_mode=Trio in log, Main→Router→Specialist flow completed real tasks through LM Studio. Further hardened by trio E2E test runner (commit acbc738) with failure classification and adaptive retries.
2026-02-21 diagnostic: Trio mode didn't activate. NanBeige ran as Inline main with full tool schemas. 21 metrics entries show tool_calls_requested: 1, tool_calls_executed: 0 — model generated tool calls (proving it had tool schemas) that were blocked as duplicates. Compaction crashed twice (n_keep 12620 >= n_ctx 8192). Fixed: B8 (metrics + circuit breaker) and B9 (tool guard replay + compaction overflow) shipped. Death spiral no longer occurs. Remaining: wire trio activation so NanBeige runs in Trio mode, not Inline.
System prompt is ~15-20K tokens even before conversation starts. Opus first call: prompt_tokens: 21705. A 3B model with 8K context has zero room. Even with 32K context, 15K of prompt leaves only 17K for conversation — and most of that prompt is AGENTS.md/SOUL.md/TOOLS.md that small models can't follow anyway.
~~Metrics broken for local models~~ — Fixed in B8 (commit 0f80ad9). Token counts now captured from llama.cpp usage field.

Experiments Needed (one assumption at a time)

E1: Role Evaluation Matrix

Test each candidate model in each role independently.

Model	As Main	As Orchestrator	As Specialist	As RLM
gemma-3n-e4b-it	?	?	?	?
nvidia_orchestrator-8b	✅ 10/10 routing	✅ proven	?	?
ministral-3-8b-instruct-2512	?	?	?	?
nanbeige4.1-3b	?	6/10	?	?

Test bench per role:

Main: 10 conversation tasks (chat quality, coherence, narration compliance)
Orchestrator: 10 routing cases (existing test suite from experiments/)
Specialist: 10 tool-calling tasks (file ops, exec, multi-step)
RLM: 5 delegation loops (multi-step file edit, research, build cycle)

E2: VRAM Profile Testing

Critical for "3 impossible things" — must work across hardware tiers.

Tier	VRAM	Trio Budget	Candidate Combos
Potato	4-6 GB	~4B total	1 model does all?
Sweet	8-12 GB	~12B total	2 small models
Power	16-24 GB	~24B total	Full trio (current target)
Beast	48+ GB	Unlimited	Bigger specialists

E3: RLM Model Shootout

The key unknown. Test candidates on delegation loop benchmarks:

Multi-step file edit (read → plan → edit → verify)
Web research synthesis (search → fetch → summarize)
Build cycle (edit → compile → fix errors → retry)

Metrics: completion rate, token cost, latency, error recovery.

E4: Integration Test

Once E1-E3 identify winners, run full nanobot session with the new trio. Compare against current setup on real tasks.

Experiment Order

E1 first — know what each model can do in each role
E3 next — find the RLM (biggest unknown)
E2 then — scale findings across VRAM tiers
E4 last — validate the winning combo end-to-end

Spacebot Ideas (parking lot)

Captured from spacebot. Ideas only, no code.

Idea	Status	Mapped to
Non-blocking compaction	✅ Absorbed into I7 (matryoshka)	Phase 0
Status injection	Backlog N6	Phase 0
Message coalescing	Backlog N7	Phase 0
Branch concept (context-fork)	Not started	Phase 2 (related to swarm)
Prompt complexity routing	Backlog I5	Phase 0
Memory bulletin (Cortex)	Not started	Phase 3 (related to memory)

Done ✅

~~Continuous Voice Mode (Realtime Module)~~ — VAD-based hands-free voice pipeline with streaming LLM→TTS output and barge-in support. InputMode enum (Continuous/PushToTalk) in session.rs. Pure next_state() state machine (VoiceAgentState: Listening/Processing/Speaking) with VoiceAction dispatch. LlmProcessor trait abstracts LLM calls (production AgentLoopProcessor wraps process_direct_streaming(), MockLlmProcessor/SlowMockLlmProcessor for tests). run_voice_event_loop() — event-driven loop: receives RealtimeEvent, runs state machine, spawns LLM tasks with CancellationToken for barge-in. drive_llm_to_tts() bridges LLM deltas through SentenceAccumulator to TTS. start_continuous_capture() — AudioCapture bridge with try_send() backpressure (drop-oldest for real-time). CLI --mode flag (continuous/ptt). 27 new tests (13 voice_agent state machine + LLM + barge-in, 13 session config + capture, 1 CLI parsing), all green. 5 files changed, +875 lines. (2026-03-03, src/realtime/voice_agent.rs, src/realtime/session.rs, src/cli.rs)
~~B11: Heartbeat as foundational liveness service~~ — HealthRegistry with pluggable HealthProbe trait, config-driven probe registration via build_registry(). First probe: LcmCompactionProbe (GET /health, 5s timeout, 60s interval, 3-failure degradation threshold). Critical fix: 30s timeout guard on both compaction spawns — in_flight always resets even on timeout/hang. Pre-flight check skips LCM compaction when endpoint degraded. Wired into HeartbeatService (Layer 0), AgentLoop, CLI, REPL. /status shows probe health with color indicators. 25+ tests (health module + TrioEndpointProbe + timeout guard), 1526 total green. (2026-02-21, commits 3bb1161, 6c71866, 1454240, src/heartbeat/health.rs, src/agent/agent_loop.rs)
~~I7: Lossless Context Management (LCM)~~ — DAG-based lossless compaction. LcmEngine with three-level escalation (LLM preserve_details → bullet_points → deterministic truncate). Dual-threshold control loop (τ_soft/τ_hard). lcm_expand tool for lossless retrieval. E2E verified against 4 local models: qwen3-0.6b best compressor (83.2%, 3.4s), nemotron-nano-12b fastest (81.4%, 2.8s). 17 tests (4 mock E2E + 1 real E2E + 1 benchmark + 4 config + 9 unit). 1526 total green. (2026-02-21, src/agent/lcm.rs, commits 0697bd4, 9893d91, bde583f, 72b94c8)
~~B3: Update default local trio~~ — Trio configured: Main gemma-3n-e4b-it, Router nvidia_orchestrator-8b, Specialist ministral-3-8b-instruct-2512. Explicit config + B10 auto-detect. (2026-02-21)
~~B3.1: Smarter deterministic fallback~~ — router_fallback.rs: 9 deterministic patterns + default ask_user (was 2). Patterns: research+URL→spawn researcher, plain URL/HN→web_fetch, latest news→spawn, read/show+path→read_file, write/create+path→write_file, edit/modify+path→edit_file, list/ls→list_dir, run/execute/cargo→exec, search→web_search. All guarded by has_tool(). 19 tests. (2026-02-21, src/agent/router_fallback.rs)
~~B4: Multi-model config schema~~ — Closed as obsolete. TrioConfig provides per-role model/port/endpoint. LM Studio JIT-loads models; no separate server spawning needed. (2026-02-21)
~~B8: Trio mode activation & role-scoped context~~ — All 5 steps complete. Metrics + circuit breaker (commit 0f80ad9). Auto-activation + auto-detect as B10 (commit 3774742). E2E verified: delegation_mode=Trio in log, Main emits natural language, Router preflight intercepts, Specialist executes tool. (2026-02-21, src/agent/router.rs)
~~Session indexer + REPL /sessions command~~ — Bridge between raw JSONL sessions (230 files, 116MB) and searchable SESSION_.md memory files. session_indexer.rs: pure extract_session_content() + index_sessions() orchestrator (extracts user+assistant messages, skips tool results, caps at 50 messages, truncates to 500 chars each). REPL: /sessions command with list/export/purge/archive/index subcommands (/ss alias). CLI: nanobot sessions index. Fixed process::exit(1) in sessions_cmd.rs for REPL safety. Updated recall tool description. E2E verified: 149 sessions indexed (6→155 SESSION_.md), idempotent re-run, grep finds content. 17 new tests, 1395 total green. (2026-02-21, src/agent/session_indexer.rs)
~~B10: Auto-detect trio models from LM Studio~~ — pick_trio_models() scans available LMS models at startup for "orchestrator"/"router" (router) and "function-calling"/"instruct"/"ministral" (specialist) patterns. Only fills empty config slots — explicit config always wins. Fuzzy main-model exclusion handles org prefixes and unresolved GGUF hints. Wired into REPL startup before auto-activation. 13 tests including e2e flow and real LMS model list. (2026-02-21, commit 3774742)
~~B9: Compaction safety guard + tool guard death spiral~~ — Tool guard replays cached results instead of injecting error messages small models can't parse. Compaction respects summarizer model's actual context window via compaction_model_context_size config + pre-flight truncation (0.7 safety margin). Circuit breaker threshold 3→2. E2E verified against NanBeige on LM Studio. (2026-02-21, commit 0f7f365)
~~B8: Metrics accuracy + tool loop circuit breaker~~ — Fixed local model metrics capture (prompt_tokens, completion_tokens, elapsed_ms). Added circuit breaker for consecutive all-blocked tool call rounds. (2026-02-21, commit 0f80ad9)
~~B7: Provider retry with backon~~ — Replaced 3 hand-rolled retry loops with backon crate. Shared is_retryable_provider_error() predicate. Added retry to streaming path. (2026-02-21, commit 640bdc9)
~~B6: SLM provider observability~~ — 8 silent failure paths now logged. #[instrument] spans on chat()/chat_stream(). Promoted llm_call_failed to warn!. (2026-02-21, commit 0b6bc5f)
~~Fix: Audit log hash chain race condition~~ — record() had a TOCTOU bug: seq allocation + prev_hash read were not serialized under the file lock. Two concurrent executors (tool_runner + inline) both read seq 940 and wrote seq 941 with the same prev_hash, forking the chain at entry 942. Fix: acquire file lock first, re-read authoritative seq + prev_hash from file under lock, then compute hash and write. 12/12 audit tests pass. (2026-02-21, commit 835cf6d, src/agent/audit.rs)
~~B1: 132 compiler warnings~~ → 0 warnings (2026-02-20)
~~B2: 2 test failures~~ → 1429 pass, 0 fail (2026-02-21)
~~Fix: Subprocess stdin steal~~ — .stdin(Stdio::null()) on all 4 spawn sites in shell.rs + worker_tools.rs (2026-02-20)
~~Fix: Esc-mashing freezes REPL~~ — drain_stdin() after cancel (2026-02-20, commit 57ec883)
~~Fix stale comment in ensure_compaction_model~~ (2026-02-17)
~~Raise tool result truncation threshold~~ (2026-02-17)
~~Document multi-session CONTEXT.md race~~ (2026-02-17)
~~Input box disappears during streaming~~ (2026-02-17)
~~Agent interruption too slow~~ (2026-02-17)
~~Subagent improvements (wait, output files, budget, compaction)~~ (2026-02-18)
~~Tool runner infinite loop fix~~ (2026-02-18)
~~Specialist content gate, daily notes reader, auto-LCM for local~~ — admit_with_specialist() in context_gate.rs: specialist LLM generates structural briefing for content gate admission. read_recent_daily_notes() in memory.rs: reads last N daily notes for context. Auto-LCM: cli.rs enables LCM automatically for local-mode sessions. TOOLS.md/IDENTITY.md workspace template auto-creation. (2026-02-21, commit 3580c38)
~~Trio resilience: top_p, health probes, circuit breaker gating~~ — top_p parameter added to LLMProvider trait + all provider implementations (OpenAI-compat, Anthropic). TrioEndpointProbe health probes for router/specialist endpoints (4 new health tests). Circuit breaker gating on router_preflight() and dispatch_specialist() — skips trio roles when endpoint degraded. HealthRegistry injected into TurnContext for runtime health checks. /status improvements. (2026-02-21, commit 1454240)
~~Trio E2E test runner with failure classification~~ — scripts/test_trio_e2e.sh: automated E2E test harness for trio mode. Failure classification (INFRA_DOWN / JIT_LOADING / TIMEOUT / MODEL_QUALITY / BUG) with adaptive retry delays. Preflight canary validates endpoint liveness before test runs. Summary reports with pass/fail counts. Auto-repair protocol for transient failures. (2026-02-21, commit acbc738)
~~QW1: Per-tool guard limits~~ — Replaced blanket max_same_tool_call_per_turn: 1 with per-tool-category limits. Read-only tools (read_file, list_dir, recall, read_skill, web_search, web_fetch) get limit 2 (originally 5, tightened 2026-02-28 to block redundant identical reads — cache replay handles blocked calls). All other tools use default 3 (raised from 1). ToolGuard now has tool_limits: HashMap<String, u32> populated from READ_TOOLS constant. 7 tests (added test_read_tool_different_args_unlimited), 1679+ total green. (2026-02-23, updated 2026-02-28, src/agent/tool_guard.rs, src/config/schema.rs)
~~QW2: Recovery hints on tool errors~~ — Added actionable Hint: suffixes to error messages in filesystem, web, and shell tools. Filesystem: file-not-found → "use list_dir", permission-denied → "check permissions", old_text-not-found → "use read_file for exact text". Web: Brave 401/403 → "check API key", 422 → "subscription may be inactive", 429 → "rate limited, wait", 5xx → "service error". Shell: timeout → "try simpler command", exec error → "verify command in PATH". 9 new tests. (2026-02-23, src/agent/tools/filesystem.rs, src/agent/tools/web.rs, src/agent/tools/shell.rs)
~~Strategic review~~ — Comprehensive review saved as REVIEW-2026-02-23.md. 5 architectural recommendations, 5 strategic recommendations, 5 UX recommendations, 5 quick wins, 5 anti-patterns. Recommended 4-week execution order. (2026-02-23)
~~QW3: Gateway slash commands + trio auto-detection~~ — 10 slash commands (/start, /help, /status, /clear, /agents, /kill, /think, /long, /context, /memory) intercepted before LLM processing in gateway mode via gateway_commands.rs. Anti-coalesce guard prevents commands from being batched with adjacent messages. Telegram bot-mention suffix stripping (/status@my_bot → /status). Trio auto-detection added to cmd_gateway() (was REPL-only). Fixed is_local hardcoded to false in gateway. (2026-02-23, src/agent/gateway_commands.rs, src/agent/agent_loop.rs, src/cli.rs)
~~QW4: Exec tool working directory + tool call dedup~~ — Three-layer fix for exec tool defaulting to workspace instead of process cwd: (1) tool_wiring.rs sets exec_working_dir to process cwd, (2) registry.rs fallback chain: config → process cwd → workspace, (3) agent_loop.rs injects working_dir into exec calls when LLM omits it. Three-layer tool call dedup for local models emitting duplicate calls: (1) batch dedup in step_execute_tools(), (2) initial batch dedup in run_tool_loop(), (3) cross-iteration dedup via seen_calls in analyze_via_scratch_pad(). System prompt improved: "Rules" section with "ALWAYS use tools, NEVER guess" + "File Operations" section with pwd-first guidance. /m command fixed for remote LM Studio (queries API instead of scanning ~/models/). (2026-02-23, src/agent/agent_loop.rs, src/agent/tool_wiring.rs, src/agent/tools/registry.rs, src/agent/context.rs, src/repl/commands.rs)
~~N7: Message coalescing~~ — Already implemented: 400ms coalesce window in agent_loop.rs:run() batches rapid messages from same session (Telegram/WhatsApp). Anti-coalesce guard added for slash commands. (Pre-existing, noted 2026-02-23)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nanobot Backlog

Ways of Working

Phase 0: Foundation (current)

🔴 Blocking — do first

✅ Codebase Health (Phases 1-4 complete, 2026-03-04)

🟡 Important — do soon

🟢 Nice to have — Phase 0

Phase 1: Million-Token Context (next)

Phase 2: Million-Step Processes (later)

Phase 3: Self-Evolving Agent (future)

Experiment Plan: Local Trio Evaluation

New Default Trio (RTX 3090, 24GB VRAM)

What We Know

Experiments Needed (one assumption at a time)

E1: Role Evaluation Matrix

E2: VRAM Profile Testing

E3: RLM Model Shootout

E4: Integration Test

Experiment Order

Spacebot Ideas (parking lot)

Done ✅

FilesExpand file tree

BACKLOG.md

Latest commit

History

BACKLOG.md

File metadata and controls

Nanobot Backlog

Ways of Working

Phase 0: Foundation (current)

🔴 Blocking — do first

✅ Codebase Health (Phases 1-4 complete, 2026-03-04)

🟡 Important — do soon

🟢 Nice to have — Phase 0

Phase 1: Million-Token Context (next)

Phase 2: Million-Step Processes (later)

Phase 3: Self-Evolving Agent (future)

Experiment Plan: Local Trio Evaluation

New Default Trio (RTX 3090, 24GB VRAM)

What We Know

Experiments Needed (one assumption at a time)

E1: Role Evaluation Matrix

E2: VRAM Profile Testing

E3: RLM Model Shootout

E4: Integration Test

Experiment Order

Spacebot Ideas (parking lot)

Done ✅