Skip to content

Distributed mesh LLM: ensemble-of-experts inference engine #54

@jeremymanning

Description

@jeremymanning

Description

This is the largest remaining feature — the distributed mesh LLM described in the whitepaper and issue #27. The mesh LLM is an inter-model Mixture-of-Experts system where GPU donor nodes each run a small language model, a distributed router selects K-of-N experts per token, and the system self-prompts to improve the cluster.

This issue supersedes #27 and provides the detailed implementation breakdown.

Architecture (from whitepaper)

  • Each GPU donor runs a complete small model (LLaMA-3-8B at 4-bit quantization, ~4-6GB VRAM)
  • Distributed router selects K-of-N expert nodes per output token
  • Each expert returns top-256 (token_id, logit) pairs (~1.5KB) — 99%+ bandwidth reduction
  • Router aggregates sparse logit distributions to produce next token
  • At K=4, 100ms latency: ~3.2 tokens/second (adequate for autonomous agents, not interactive chat)
  • LLaMA-3 tokenizer standardized (128,256 tokens)

Three uses (from #27)

  1. Resource: continually growing/improving language model, free to everyone
  2. Self-improvement: guides development and improvements of the network itself
  3. Security: carries out regular security audits and spot checks

Fractal scaling (from #27)

  • Max intelligence: ALL nodes → one "super" LLM
  • Intermediate: resources allocated by problem complexity
  • Minimum: single modest-hardware node as simple model

Components (from spec Phase 9, T111-T119)

  1. Router (src/agent/mesh_llm/router.rs): K-of-N expert selection per token, LLaMA-3 tokenizer
  2. Expert node (src/agent/mesh_llm/expert.rs): registration, health tracking, capacity reporting
  3. Aggregator (src/agent/mesh_llm/aggregator.rs): sparse logit aggregation, weighted average, sampling
  4. Self-prompting loop (src/agent/mesh_llm/self_prompt.rs): autonomous agent generating improvement tasks
  5. Agent subsetting (src/agent/mesh_llm/subset.rs): independent parallel agent subsets for concurrent tasks
  6. Safety system (src/agent/mesh_llm/safety.rs): action tier classification, governance kill switch
  7. gRPC service (proto/mesh_llm.proto): RegisterExpert, GetRouterStatus, SubmitSelfTask, HaltMesh

Action tiers (from whitepaper)

Tier Examples Approval
Read-only Analyze metrics, generate reports None
Suggest Draft config changes, governance motions Human review
Sandbox-test A/B experiment on 1% of traffic Automated validation
Deploy-minor Update non-critical config 2-of-3 governance quorum
Deploy-major Change scheduler algorithm Full governance vote + 24h review

Phased rollout

Phase Nodes Capability
0-1 0-500 Centralized model; read-only + suggest only
2 ~280-1,000 Distributed ensemble; sandbox-test after 30-day stability
3 ~1,000 3-7 parallel domain streams; deploy-minor
4 ~5,000+ 37+ parallel streams; deploy-major

Requirements

  • Router model with K-of-N expert selection
  • Sparse logit aggregation (top-256 logits per expert)
  • Expert node registration and health monitoring
  • Self-prompting autonomous agent loop (1-24 hour cycle)
  • Action tier classification with safety enforcement
  • Governance kill switch (cannot be overridden by mesh itself)
  • gRPC service for mesh management
  • Support for heterogeneous GPU hardware (different model sizes/fine-tunes, same tokenizer)
  • Graceful degradation below 280 nodes (fall back to centralized model)

Success Criteria

  • Router selects K-of-N experts and dispatches in parallel
  • Sparse logit aggregation produces coherent text
  • Expert registration and health tracking functional
  • Self-prompting loop generates actionable improvement tasks
  • Action tier classification correctly gates operations
  • Governance kill switch immediately halts all inference
  • gRPC service exposes all management operations
  • 3.2+ tokens/second at K=4, 100ms inter-node latency
  • Integration test: multi-node token generation via sparse aggregation

Testing (Principle V)

  • Deploy 4+ GPU nodes with LLaMA-3-8B (4-bit) → verify token generation
  • Measure tokens/second at various K values and latencies
  • Test kill switch → verify immediate halt
  • Test self-prompting loop → verify actionable output
  • Test action tier escalation → verify governance gating
  • Test with heterogeneous models (different sizes, same tokenizer)
  • Test graceful degradation with fewer than 280 nodes
  • Bandwidth measurement: verify <2KB per expert per token

Notes

This is a major undertaking that should be broken into sub-tasks during planning. The phased rollout means Phase 0-1 (centralized model, read-only) can ship first, with distributed ensemble features enabled at each phase transition via governance vote.

References:

  • Whitepaper: §Mesh LLM: Distributed Self-Improvement
  • Issue Explore and implement distributed LLM #27: parallel_mesh_of_diffusers_whitepaper.pdf
  • research/09-mesh-llm.md
  • research/10-prior-art-distributed-inference.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions