Refine #3317: delete shepherd + daemon brains, keep minimal spawn loop for multi-account
TL;DR
#3317 proposed deleting loom-daemon + the shepherd model entirely in favor of /loom:sweep. The framing was approximately correct but over-aggressive on one axis: the process-spawn loop is load-bearing for multi-account load balancing and cannot be deleted, even though the shepherd brain (~16.5k Python LOC) and daemon brain (~4.2k Python LOC) can. This proposal refines #3317 with the right kill list.
Net effect: ~21k LOC deleted, ~150 LOC of minimal spawn-loop added. Same order of magnitude as #3317's "10:1 simplification" claim, with one critical piece preserved.
Why this is the right cut now (not earlier)
A lot of historical investment went into making the Python shepherd robust: checkpoint/resume, judge-retry loops, milestone reporting, stuck detection, heartbeat tracking, force-mode label management, ~10 phase modules totalling 11,500 lines. That investment was correct given the tooling constraints of 2025: Claude Code's native parallel subagent dispatch was either unavailable, unreliable, or untested for orchestration workloads. Loom had to build robustness from scratch in Python because the harness couldn't be trusted to do it.
That constraint has relaxed. Claude Code now reliably dispatches parallel Task subagents at one level of depth (validated empirically up to --builders-per-wave 3 in sweep), handles streaming and tool-use without the pathologies that motivated the shell shepherd, and provides native progress reporting through its own conversation log. The robustness sweep needs is much smaller than the robustness the Python shepherd had to invent, because sweep delegates most of it to a more mature harness.
In short: the Python shepherd is not overengineered. It was correctly engineered for its era. The era has changed.
The constraint that survives: multi-account
Claude Code's Task tool (subagent dispatch) inherits the parent session's OAuth token. There is no in-session mechanism to rotate accounts across subagents. Multi-account load balancing therefore requires per-task process spawn — a fresh claude invocation through spawn-claude.sh that picks a token from .loom/tokens/.ranking at process start.
A single /loom:sweep instance is hard-pinned to one account for its entire run. For users running multi-account batches (Pro/Max with rotation across 3-10 OAuth tokens), this is the deciding architectural constraint.
The implication: a thin spawn loop must survive. Its only job is to pick tokens and launch claude -p '/loom:sweep <N>' per unit of work.
Kill list
| Component |
LOC |
Disposition |
Notes |
loom-tools/src/loom_tools/shepherd/ |
~16,500 |
Delete |
Sweep already implements curator/builder/judge/doctor/merge. Builder phase alone is 6,062 LOC; sweep delegates this to the loom-builder subagent. |
loom-tools/src/loom_tools/daemon_v2/loop.py, iteration.py, actions/{completions,proposals,retry_blocked,spinning,support_roles}.py |
~3,500 |
Delete |
Work generation, pool scaling, support-role triggers — sweep partitions work eagerly; periodic role triggers move to GitHub Actions / cron (#3317 Gap A). |
defaults/.claude/commands/loom/shepherd.md + shepherd-lifecycle.md (+ mirrored roles/) |
~2,140 |
Delete |
/shepherd is a signal-writer proxy to the daemon; dies with it. |
agent_spawn.py |
~800 |
Keep + simplify |
Token selection + process launch. Strip shepherd-specific spawn paths; keep the wrapper-script integration. |
claude-wrapper.sh, spawn-claude.sh |
~600 |
Keep |
Token retry on TOKEN_EXPIRED / TOKEN_EXHAUSTED; allowlist + bad-tokens handling; ranking-based selection. |
loom-tools/src/loom_tools/daemon_v2/actions/shepherds.py (spawn path only) + signals.py + command_poller.py |
~500 |
Distill into ~150 LOC spawn loop |
New script picks ready issues by label, claims them, spawns claude -p '/loom:sweep <N>' per token slot. Replaces ~700 LOC of pool-management state. |
.loom/signals/, .loom/progress/ directories |
n/a |
Optional keep |
Useful for status visibility; sweep can write to .loom/progress/ via existing report-milestone.sh. |
loom-daemon/ (Rust crate) |
~? |
TBD per #3317 |
Out of scope here; the Rust shell is mostly IPC + terminal management. May survive as the spawn-loop host. |
Net delete: ~22k LOC. Net add: ~150 LOC spawn loop. Ratio holds.
Features to port to sweep before deprecation
Four features in the Python shepherd are worth porting; the rest can be lost without harm.
Port (must)
--resume from checkpoint after crash — Python shepherd writes a checkpoint per phase and resumes mid-flight on a new spawn. Sweep currently does not. Port estimate: 100-150 LOC in sweep skill (a .loom/sweep-checkpoint/<issue>.json write at each phase boundary; on entry, check for it and skip completed stages).
Port (optional, low cost)
- Milestone reporting to
.loom/progress/ — sweep can call the existing report-milestone.sh from its phase transitions. ~20 lines of additions.
Already in sweep
- Judge-retry with backoff — sweep's inline Doctor cycle is the architectural equivalent. The Python shepherd's
judge_retry milestones are redundant.
Shift to spawn loop
- Mid-flight token rotation on 429 — when a sweep instance hits its account's rate limit, the spawn loop sees the process exit non-zero, marks the token bad, and relaunches
/loom:sweep <remaining-issues> on a fresh token. This is coarser than the shepherd's per-call retry but operationally equivalent at the batch level, and it falls out of claude-wrapper.sh's existing error classification.
Proposed minimal spawn loop (~150 LOC)
# loom-spawn-loop.py — conceptual sketch
def main():
while not shutdown_requested():
ready = gh_issues_with_label("loom:issue", limit=50)
slots_free = MAX_SHEPHERDS - count_running_processes()
for issue in ready[:slots_free]:
if claim_issue(issue):
token = select_token_via_ranking() # existing logic
spawn_claude_p(
f"/loom:sweep {issue}",
token=token,
detach=True,
)
sleep(POLL_INTERVAL_SECONDS)
That's the whole loop. It owns nothing — no state file, no pipeline-state JSON, no completion harvesting, no warning system. Issue label transitions (loom:issue → loom:building → closed) are the only state, and claim_issue is an atomic label swap via gh issue edit. If the loop dies, the running sweep processes survive and finish their issues. If a sweep process crashes mid-flight, the next loop tick re-claims and --resumes.
Gaps to address
| Gap |
Status |
Resolution |
| Periodic support roles (champion, auditor, guide on 5-15min intervals) |
Open from #3317 |
GitHub Actions schedules or a separate /schedule cron; not the spawn loop's job. |
Cross-session retry history (proposal cooldowns: last_architect_trigger) |
Open from #3317 |
.loom/sweep-history.json (~50 LOC), as in #3317. |
Sphere downstream coordination (loom-{daemon,shepherd} subagent files) |
Open from #3317 |
Migration plan ahead of sphere's loom-install rebase. Same as #3317. |
| Sweep checkpoint/resume |
New |
Phase 0 port; ~150 LOC into sweep skill. |
| Daemon-state.json consumers (Tauri app dashboards, MCP tools) |
New |
Inventory uses; either retire or have spawn loop emit a compatible-but-minimal daemon-state.json for display. |
Phased plan
| Phase |
Description |
Status |
| 0 |
Port --resume checkpoint behavior into sweep skill |
Not started |
| 1 |
Build the ~150-line spawn loop; add LOOM_USE_SPAWN_LOOP=1 opt-in |
Not started |
| 2 |
Soft-deprecate /shepherd skill (warns to user, still works); soft-deprecate daemon_v2 brain (works but warns) |
Not started |
| 3 |
Delete Python shepherd, daemon_v2 brain, and /shepherd skill; spawn loop becomes default |
vN.0 |
| 4 |
Sphere coordination + downstream migration |
Parallel to Phase 2-3 |
Open questions for the architect / champion
- Spawn loop host: does this run as the Rust
loom-daemon crate (slimmed dramatically), as a new Python loom-spawn package, or as a shell script? Bias toward shell — the loop is small enough to read in one screen.
- Backwards compatibility: do existing
/shepherd <N> invocations get rewritten to /loom:sweep <N> (alias), or hard-deleted at Phase 3?
- Tauri app integration: the Loom desktop app reads
daemon-state.json and .loom/progress/. Does it need a stable shape from the spawn loop, or do we let those files retire?
- Should sweep itself gain a
--detached or --daemon-mode flag so the spawn loop can launch it cleanly, or is claude -p invocation enough?
Related
Refine #3317: delete shepherd + daemon brains, keep minimal spawn loop for multi-account
TL;DR
#3317 proposed deleting
loom-daemon+ the shepherd model entirely in favor of/loom:sweep. The framing was approximately correct but over-aggressive on one axis: the process-spawn loop is load-bearing for multi-account load balancing and cannot be deleted, even though the shepherd brain (~16.5k Python LOC) and daemon brain (~4.2k Python LOC) can. This proposal refines #3317 with the right kill list.Net effect: ~21k LOC deleted, ~150 LOC of minimal spawn-loop added. Same order of magnitude as #3317's "10:1 simplification" claim, with one critical piece preserved.
Why this is the right cut now (not earlier)
A lot of historical investment went into making the Python shepherd robust: checkpoint/resume, judge-retry loops, milestone reporting, stuck detection, heartbeat tracking, force-mode label management, ~10 phase modules totalling 11,500 lines. That investment was correct given the tooling constraints of 2025: Claude Code's native parallel subagent dispatch was either unavailable, unreliable, or untested for orchestration workloads. Loom had to build robustness from scratch in Python because the harness couldn't be trusted to do it.
That constraint has relaxed. Claude Code now reliably dispatches parallel Task subagents at one level of depth (validated empirically up to
--builders-per-wave 3in sweep), handles streaming and tool-use without the pathologies that motivated the shell shepherd, and provides native progress reporting through its own conversation log. The robustness sweep needs is much smaller than the robustness the Python shepherd had to invent, because sweep delegates most of it to a more mature harness.In short: the Python shepherd is not overengineered. It was correctly engineered for its era. The era has changed.
The constraint that survives: multi-account
Claude Code's Task tool (subagent dispatch) inherits the parent session's OAuth token. There is no in-session mechanism to rotate accounts across subagents. Multi-account load balancing therefore requires per-task process spawn — a fresh
claudeinvocation throughspawn-claude.shthat picks a token from.loom/tokens/.rankingat process start.A single
/loom:sweepinstance is hard-pinned to one account for its entire run. For users running multi-account batches (Pro/Max with rotation across 3-10 OAuth tokens), this is the deciding architectural constraint.The implication: a thin spawn loop must survive. Its only job is to pick tokens and launch
claude -p '/loom:sweep <N>'per unit of work.Kill list
loom-tools/src/loom_tools/shepherd/loom-buildersubagent.loom-tools/src/loom_tools/daemon_v2/loop.py,iteration.py,actions/{completions,proposals,retry_blocked,spinning,support_roles}.pydefaults/.claude/commands/loom/shepherd.md+shepherd-lifecycle.md(+ mirroredroles/)/shepherdis a signal-writer proxy to the daemon; dies with it.agent_spawn.pyclaude-wrapper.sh,spawn-claude.shTOKEN_EXPIRED/TOKEN_EXHAUSTED; allowlist + bad-tokens handling; ranking-based selection.loom-tools/src/loom_tools/daemon_v2/actions/shepherds.py(spawn path only) +signals.py+command_poller.pyclaude -p '/loom:sweep <N>'per token slot. Replaces ~700 LOC of pool-management state..loom/signals/,.loom/progress/directories.loom/progress/via existingreport-milestone.sh.loom-daemon/(Rust crate)Net delete: ~22k LOC. Net add: ~150 LOC spawn loop. Ratio holds.
Features to port to sweep before deprecation
Four features in the Python shepherd are worth porting; the rest can be lost without harm.
Port (must)
--resumefrom checkpoint after crash — Python shepherd writes a checkpoint per phase and resumes mid-flight on a new spawn. Sweep currently does not. Port estimate: 100-150 LOC in sweep skill (a.loom/sweep-checkpoint/<issue>.jsonwrite at each phase boundary; on entry, check for it and skip completed stages).Port (optional, low cost)
.loom/progress/— sweep can call the existingreport-milestone.shfrom its phase transitions. ~20 lines of additions.Already in sweep
judge_retrymilestones are redundant.Shift to spawn loop
/loom:sweep <remaining-issues>on a fresh token. This is coarser than the shepherd's per-call retry but operationally equivalent at the batch level, and it falls out ofclaude-wrapper.sh's existing error classification.Proposed minimal spawn loop (~150 LOC)
That's the whole loop. It owns nothing — no state file, no pipeline-state JSON, no completion harvesting, no warning system. Issue label transitions (
loom:issue→loom:building→ closed) are the only state, andclaim_issueis an atomic label swap viagh issue edit. If the loop dies, the running sweep processes survive and finish their issues. If a sweep process crashes mid-flight, the next loop tick re-claims and--resumes.Gaps to address
/schedulecron; not the spawn loop's job.last_architect_trigger).loom/sweep-history.json(~50 LOC), as in #3317.loom-{daemon,shepherd}subagent files)daemon-state.jsonfor display.Phased plan
--resumecheckpoint behavior into sweep skillLOOM_USE_SPAWN_LOOP=1opt-in/shepherdskill (warns to user, still works); soft-deprecate daemon_v2 brain (works but warns)/shepherdskill; spawn loop becomes defaultOpen questions for the architect / champion
loom-daemoncrate (slimmed dramatically), as a new Pythonloom-spawnpackage, or as a shell script? Bias toward shell — the loop is small enough to read in one screen./shepherd <N>invocations get rewritten to/loom:sweep <N>(alias), or hard-deleted at Phase 3?daemon-state.jsonand.loom/progress/. Does it need a stable shape from the spawn loop, or do we let those files retire?--detachedor--daemon-modeflag so the spawn loop can launch it cleanly, or isclaude -pinvocation enough?Related
--builders-per-wave), feat(sweep): natural-language selector parsing for /loom:sweep #3318 (NL selectors), feat(sweep): --dry-run flag for /loom:sweep #3319 (--dry-run)