architect: refine #3317 — delete shepherd + daemon brains, keep minimal spawn loop for multi-account

# Refine #3317: delete shepherd + daemon brains, keep minimal spawn loop for multi-account

## TL;DR

#3317 proposed deleting `loom-daemon` + the shepherd model entirely in favor of `/loom:sweep`. The framing was approximately correct but over-aggressive on one axis: **the process-spawn loop is load-bearing for multi-account load balancing** and cannot be deleted, even though the shepherd brain (~16.5k Python LOC) and daemon brain (~4.2k Python LOC) can. This proposal refines #3317 with the right kill list.

**Net effect:** ~21k LOC deleted, ~150 LOC of minimal spawn-loop added. Same order of magnitude as #3317's "10:1 simplification" claim, with one critical piece preserved.

## Why this is the right cut now (not earlier)

A lot of historical investment went into making the Python shepherd robust: checkpoint/resume, judge-retry loops, milestone reporting, stuck detection, heartbeat tracking, force-mode label management, ~10 phase modules totalling 11,500 lines. That investment was *correct given the tooling constraints of 2025*: Claude Code's native parallel subagent dispatch was either unavailable, unreliable, or untested for orchestration workloads. Loom had to build robustness from scratch in Python because the harness couldn't be trusted to do it.

That constraint has relaxed. Claude Code now reliably dispatches parallel Task subagents at one level of depth (validated empirically up to `--builders-per-wave 3` in sweep), handles streaming and tool-use without the pathologies that motivated the shell shepherd, and provides native progress reporting through its own conversation log. The robustness sweep needs is much smaller than the robustness the Python shepherd had to invent, because sweep delegates most of it to a more mature harness.

In short: the Python shepherd is not overengineered. It was correctly engineered for its era. The era has changed.

## The constraint that survives: multi-account

Claude Code's Task tool (subagent dispatch) inherits the parent session's OAuth token. There is no in-session mechanism to rotate accounts across subagents. Multi-account load balancing therefore requires **per-task process spawn** — a fresh `claude` invocation through `spawn-claude.sh` that picks a token from `.loom/tokens/.ranking` at process start.

A single `/loom:sweep` instance is hard-pinned to one account for its entire run. For users running multi-account batches (Pro/Max with rotation across 3-10 OAuth tokens), this is the deciding architectural constraint.

The implication: a thin spawn loop must survive. Its only job is to pick tokens and launch `claude -p '/loom:sweep <N>'` per unit of work.

## Kill list

| Component | LOC | Disposition | Notes |
|---|---|---|---|
| `loom-tools/src/loom_tools/shepherd/` | ~16,500 | **Delete** | Sweep already implements curator/builder/judge/doctor/merge. Builder phase alone is 6,062 LOC; sweep delegates this to the `loom-builder` subagent. |
| `loom-tools/src/loom_tools/daemon_v2/loop.py`, `iteration.py`, `actions/{completions,proposals,retry_blocked,spinning,support_roles}.py` | ~3,500 | **Delete** | Work generation, pool scaling, support-role triggers — sweep partitions work eagerly; periodic role triggers move to GitHub Actions / cron (#3317 Gap A). |
| `defaults/.claude/commands/loom/shepherd.md` + `shepherd-lifecycle.md` (+ mirrored `roles/`) | ~2,140 | **Delete** | `/shepherd` is a signal-writer proxy to the daemon; dies with it. |
| `agent_spawn.py` | ~800 | **Keep + simplify** | Token selection + process launch. Strip shepherd-specific spawn paths; keep the wrapper-script integration. |
| `claude-wrapper.sh`, `spawn-claude.sh` | ~600 | **Keep** | Token retry on `TOKEN_EXPIRED` / `TOKEN_EXHAUSTED`; allowlist + bad-tokens handling; ranking-based selection. |
| `loom-tools/src/loom_tools/daemon_v2/actions/shepherds.py` (spawn path only) + `signals.py` + `command_poller.py` | ~500 | **Distill into ~150 LOC spawn loop** | New script picks ready issues by label, claims them, spawns `claude -p '/loom:sweep <N>'` per token slot. Replaces ~700 LOC of pool-management state. |
| `.loom/signals/`, `.loom/progress/` directories | n/a | **Optional keep** | Useful for status visibility; sweep can write to `.loom/progress/` via existing `report-milestone.sh`. |
| `loom-daemon/` (Rust crate) | ~? | **TBD per #3317** | Out of scope here; the Rust shell is mostly IPC + terminal management. May survive as the spawn-loop host. |

**Net delete:** ~22k LOC. **Net add:** ~150 LOC spawn loop. Ratio holds.

## Features to port to sweep before deprecation

Four features in the Python shepherd are worth porting; the rest can be lost without harm.

### Port (must)
1. **`--resume` from checkpoint after crash** — Python shepherd writes a checkpoint per phase and resumes mid-flight on a new spawn. Sweep currently does not. Port estimate: 100-150 LOC in sweep skill (a `.loom/sweep-checkpoint/<issue>.json` write at each phase boundary; on entry, check for it and skip completed stages).

### Port (optional, low cost)
2. **Milestone reporting to `.loom/progress/`** — sweep can call the existing `report-milestone.sh` from its phase transitions. ~20 lines of additions.

### Already in sweep
3. **Judge-retry with backoff** — sweep's inline Doctor cycle is the architectural equivalent. The Python shepherd's `judge_retry` milestones are redundant.

### Shift to spawn loop
4. **Mid-flight token rotation on 429** — when a sweep instance hits its account's rate limit, the spawn loop sees the process exit non-zero, marks the token bad, and relaunches `/loom:sweep <remaining-issues>` on a fresh token. This is *coarser* than the shepherd's per-call retry but operationally equivalent at the batch level, and it falls out of `claude-wrapper.sh`'s existing error classification.

## Proposed minimal spawn loop (~150 LOC)

```python
# loom-spawn-loop.py — conceptual sketch
def main():
    while not shutdown_requested():
        ready = gh_issues_with_label("loom:issue", limit=50)
        slots_free = MAX_SHEPHERDS - count_running_processes()
        for issue in ready[:slots_free]:
            if claim_issue(issue):
                token = select_token_via_ranking()  # existing logic
                spawn_claude_p(
                    f"/loom:sweep {issue}",
                    token=token,
                    detach=True,
                )
        sleep(POLL_INTERVAL_SECONDS)
```

That's the whole loop. It owns nothing — no state file, no pipeline-state JSON, no completion harvesting, no warning system. Issue label transitions (`loom:issue` → `loom:building` → closed) are the only state, and `claim_issue` is an atomic label swap via `gh issue edit`. If the loop dies, the running sweep processes survive and finish their issues. If a sweep process crashes mid-flight, the next loop tick re-claims and `--resume`s.

## Gaps to address

| Gap | Status | Resolution |
|---|---|---|
| Periodic support roles (champion, auditor, guide on 5-15min intervals) | Open from #3317 | GitHub Actions schedules or a separate `/schedule` cron; not the spawn loop's job. |
| Cross-session retry history (proposal cooldowns: `last_architect_trigger`) | Open from #3317 | `.loom/sweep-history.json` (~50 LOC), as in #3317. |
| Sphere downstream coordination (`loom-{daemon,shepherd}` subagent files) | Open from #3317 | Migration plan ahead of sphere's loom-install rebase. Same as #3317. |
| Sweep checkpoint/resume | New | Phase 0 port; ~150 LOC into sweep skill. |
| Daemon-state.json consumers (Tauri app dashboards, MCP tools) | New | Inventory uses; either retire or have spawn loop emit a compatible-but-minimal `daemon-state.json` for display. |

## Phased plan

| Phase | Description | Status |
|---|---|---|
| 0 | Port `--resume` checkpoint behavior into sweep skill | Not started |
| 1 | Build the ~150-line spawn loop; add `LOOM_USE_SPAWN_LOOP=1` opt-in | Not started |
| 2 | Soft-deprecate `/shepherd` skill (warns to user, still works); soft-deprecate daemon_v2 brain (works but warns) | Not started |
| 3 | Delete Python shepherd, daemon_v2 brain, and `/shepherd` skill; spawn loop becomes default | vN.0 |
| 4 | Sphere coordination + downstream migration | Parallel to Phase 2-3 |

## Open questions for the architect / champion

1. **Spawn loop host:** does this run as the Rust `loom-daemon` crate (slimmed dramatically), as a new Python `loom-spawn` package, or as a shell script? Bias toward shell — the loop is small enough to read in one screen.
2. **Backwards compatibility:** do existing `/shepherd <N>` invocations get rewritten to `/loom:sweep <N>` (alias), or hard-deleted at Phase 3?
3. **Tauri app integration:** the Loom desktop app reads `daemon-state.json` and `.loom/progress/`. Does it need a stable shape from the spawn loop, or do we let those files retire?
4. **Should sweep itself gain a `--detached` or `--daemon-mode` flag** so the spawn loop can launch it cleanly, or is `claude -p` invocation enough?

## Related

- Original proposal: #3317 (closed 2026-05-28)
- Subagent-dispatch hazard: #3289 (sweep's "one level deep" rule)
- Token rotation infrastructure: #3234, #3236
- Sweep extensions merged: #3316 (`--builders-per-wave`), #3318 (NL selectors), #3319 (`--dry-run`)


Component	LOC	Disposition	Notes
`loom-tools/src/loom_tools/shepherd/`	~16,500	Delete	Sweep already implements curator/builder/judge/doctor/merge. Builder phase alone is 6,062 LOC; sweep delegates this to the `loom-builder` subagent.
`loom-tools/src/loom_tools/daemon_v2/loop.py`, `iteration.py`, `actions/{completions,proposals,retry_blocked,spinning,support_roles}.py`	~3,500	Delete	Work generation, pool scaling, support-role triggers — sweep partitions work eagerly; periodic role triggers move to GitHub Actions / cron (#3317 Gap A).
`defaults/.claude/commands/loom/shepherd.md` + `shepherd-lifecycle.md` (+ mirrored `roles/`)	~2,140	Delete	`/shepherd` is a signal-writer proxy to the daemon; dies with it.
`agent_spawn.py`	~800	Keep + simplify	Token selection + process launch. Strip shepherd-specific spawn paths; keep the wrapper-script integration.
`claude-wrapper.sh`, `spawn-claude.sh`	~600	Keep	Token retry on `TOKEN_EXPIRED` / `TOKEN_EXHAUSTED`; allowlist + bad-tokens handling; ranking-based selection.
`loom-tools/src/loom_tools/daemon_v2/actions/shepherds.py` (spawn path only) + `signals.py` + `command_poller.py`	~500	Distill into ~150 LOC spawn loop	New script picks ready issues by label, claims them, spawns `claude -p '/loom:sweep <N>'` per token slot. Replaces ~700 LOC of pool-management state.
`.loom/signals/`, `.loom/progress/` directories	n/a	Optional keep	Useful for status visibility; sweep can write to `.loom/progress/` via existing `report-milestone.sh`.
`loom-daemon/` (Rust crate)	~?	TBD per #3317	Out of scope here; the Rust shell is mostly IPC + terminal management. May survive as the spawn-loop host.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

architect: refine #3317 — delete shepherd + daemon brains, keep minimal spawn loop for multi-account #3372

Refine #3317: delete shepherd + daemon brains, keep minimal spawn loop for multi-account

TL;DR

Why this is the right cut now (not earlier)

The constraint that survives: multi-account

Kill list

Features to port to sweep before deprecation

Port (must)

Port (optional, low cost)

Already in sweep

Shift to spawn loop

Proposed minimal spawn loop (~150 LOC)

Gaps to address

Phased plan

Open questions for the architect / champion

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gap	Status	Resolution
Periodic support roles (champion, auditor, guide on 5-15min intervals)	Open from #3317	GitHub Actions schedules or a separate `/schedule` cron; not the spawn loop's job.
Cross-session retry history (proposal cooldowns: `last_architect_trigger`)	Open from #3317	`.loom/sweep-history.json` (~50 LOC), as in #3317.
Sphere downstream coordination (`loom-{daemon,shepherd}` subagent files)	Open from #3317	Migration plan ahead of sphere's loom-install rebase. Same as #3317.
Sweep checkpoint/resume	New	Phase 0 port; ~150 LOC into sweep skill.
Daemon-state.json consumers (Tauri app dashboards, MCP tools)	New	Inventory uses; either retire or have spawn loop emit a compatible-but-minimal `daemon-state.json` for display.

Phase	Description	Status
0	Port `--resume` checkpoint behavior into sweep skill	Not started
1	Build the ~150-line spawn loop; add `LOOM_USE_SPAWN_LOOP=1` opt-in	Not started
2	Soft-deprecate `/shepherd` skill (warns to user, still works); soft-deprecate daemon_v2 brain (works but warns)	Not started
3	Delete Python shepherd, daemon_v2 brain, and `/shepherd` skill; spawn loop becomes default	vN.0
4	Sphere coordination + downstream migration	Parallel to Phase 2-3

architect: refine #3317 — delete shepherd + daemon brains, keep minimal spawn loop for multi-account #3372

Description

Refine #3317: delete shepherd + daemon brains, keep minimal spawn loop for multi-account

TL;DR

Why this is the right cut now (not earlier)

The constraint that survives: multi-account

Kill list

Features to port to sweep before deprecation

Port (must)

Port (optional, low cost)

Already in sweep

Shift to spawn loop

Proposed minimal spawn loop (~150 LOC)

Gaps to address

Phased plan

Open questions for the architect / champion

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions