Skip to content

feat(coord): resume protocol + health-check transition triggers (#349)#355

Merged
mercurialsolo merged 1 commit into
mainfrom
feat/supervisor-pr6-resume-health
Jun 10, 2026
Merged

feat(coord): resume protocol + health-check transition triggers (#349)#355
mercurialsolo merged 1 commit into
mainfrom
feat/supervisor-pr6-resume-health

Conversation

@mercurialsolo

Copy link
Copy Markdown
Owner

PR6 of the supervisor RFC (#342). Closes #349.

The supervisor now reacts to session death and the 10 shipped health checks instead of letting them sit as dashboard advisories.

Resume protocol (`src/coord/resume.rs`, RFC §7)

Three contracts:

  1. Tree-state hash — snapshotted at every spawn/assign, compared at resume. Mismatch = external edits between death and resume; the supervisor escalates to NeedsHuman instead of resuming blindly. Git-based when available, mtime-fallback otherwise. Fails open to `TreeStateHash::empty()` so resume keeps working when git or fs reads fail.
  2. Resume the task, not the session — no dependence on Claude Code's session-resume internals. `build_recovery_prompt` composes: original prompt → resume framing → drift warning (if any) → prior verifier failures → autopsy summary.
  3. Bounded retries — attempts cap from PR4 still applies. Resume bumps attempt; over-cap → NeedsHuman with `resume-cap` cause.

Health-check transition triggers (RFC §6)

`Policy` gains a `health_actions: HealthActionMap` keyed on the 10 shipped `HealthCheck::name` strings. Each maps to `Resume`, `Escalate`, or `Ignore`. Defaults match the RFC §6 table:

Health check Default action
Stalled Resume
Loop detected Escalate (NeedsHuman)
Repetition Escalate (NeedsHuman)
Error acceleration Resume
Cost spike Ignore
Context saturation Ignore (handled by existing pipeline)
Cognitive decay Ignore

Reconciler additions

`ObservedSession` gains `health_alerts: Vec`. The Running-task pass joins each task to its observable session via `tasks::latest_session_id` and emits:

  • `Action::Resume` when status == Dead (RFC §7's canonical trigger)
  • `Action::Resume` / `Action::EscalateHuman` per `HealthAction` for the first non-Ignore alert

`Unknown` status remains the no-actuation backstop — health alerts on an Unknown session are ignored. First actionable alert wins so we don't emit conflicting Resume + Escalate for the same task on the same tick.

Actuator additions

`Action::Resume`: Only valid from Running. Attempts cap → NeedsHuman (`resume-cap`). Otherwise Running → Resuming → Pending. Re-entering via Pending lets the reconciler pick the same assignment lane the original task did.

`Action::EscalateHuman`: Idempotent against already-terminal tasks. Transitions to NeedsHuman with the caller's cause string.

Verification

```
cargo check / clippy / fmt — both feature sets, all green
cargo test --all-targets → 759 + 770 + 78 + 8 = 1615 pass
```

Six new reconciler tests cover the full path:

  • Dead session → Resume(session_died)
  • Unknown session → no actuation (Stalled alert ignored)
  • Stalled alert → Resume(health:Stalled)
  • Loop detected alert → EscalateHuman(health:Loop detected)
  • Cost spike alert (Ignore) → no actuation
  • Mixed alerts → first actionable wins (Resume from Stalled, not Escalate from later Loop)

Out of scope

  • Wiring the live JSONL/ps Sensors to populate `health_alerts` — type surface is here; runtime adapter lands with PR7.
  • Cost-spike policy plane (Ignore today) — lands when per-task budget enforcement gets its own knobs.
  • Autopsy summary injected at resume time — helper is wired and tested; threading it through the actuator's recovery-context build needs the live spawn path that's also PR7's pickup.

Test plan

  • Submit a task that runs a long sleep; `kill -9` the headless daemon's spawned session → reconciler emits Resume on next tick; task moves Running → Resuming → Pending → Assigned.
  • Inject a "Stalled" health alert via test/JSON poke → same Resume path triggers.
  • Inject "Loop detected" → EscalateHuman path triggers; task lands at NeedsHuman.

🤖 Generated with Claude Code

PR6 of the supervisor RFC (#342). The supervisor now reacts to session
death and the 10 shipped health checks instead of letting them sit as
dashboard advisories.

## Resume protocol (`src/coord/resume.rs`, RFC §7)

Three contracts:

1. **Tree-state hash** — snapshotted at every spawn/assign, compared at
   resume. Mismatch = external edits between death and resume; the
   supervisor escalates to NeedsHuman instead of resuming blindly.
   Git-based when available, mtime-fallback otherwise; both produce
   stable strings any consumer can compare for equality. Fails open to
   `TreeStateHash::empty()` so resume keeps working when git or fs
   reads fail.

2. **Resume the task, not the session** — no dependence on Claude
   Code's session-resume internals. `build_recovery_prompt` composes:
   original prompt → resume framing → drift warning (if any) → prior
   verifier failures (so the next attempt doesn't repeat them) →
   autopsy summary.

3. **Bounded retries** — attempts cap from PR4 still applies. Resume
   bumps attempt; over-cap → NeedsHuman with `resume-cap` cause.

`summarize_autopsy` lifts the relevant fields out of
`brain::autopsy::AutopsyReport` so this module owns the "what does
resume need" view.

## Health-check transition triggers (RFC §6)

`Policy` gains a `health_actions: HealthActionMap` keyed on the 10
shipped `HealthCheck::name` strings. Each maps to `Resume`, `Escalate`,
or `Ignore`. Defaults match the RFC §6 table:

| Health check          | Default action       |
|-----------------------|----------------------|
| Stalled               | Resume               |
| Loop detected         | Escalate (NeedsHuman)|
| Repetition            | Escalate (NeedsHuman)|
| Error acceleration    | Resume               |
| Cost spike            | Ignore               |
| Context saturation    | Ignore (handled by   |
|                       | existing pipeline)   |
| Cognitive decay       | Ignore               |

## Reconciler additions

`ObservedSession` gains `health_alerts: Vec<String>`. The reconciler's
Running-task pass now joins each task to its observable session via
`tasks::latest_session_id`, emits:

- `Action::Resume` when status == Dead (RFC §7's canonical trigger)
- `Action::Resume` / `Action::EscalateHuman` per `HealthAction` for
  the first non-Ignore alert

`Unknown` status remains the no-actuation backstop — health alerts on
an Unknown session are ignored. First actionable alert wins so we
don't emit conflicting Resume + Escalate for the same task on the
same tick.

## Actuator additions

`Action::Resume`:
- Only valid from Running. Stale tick → no-op.
- Attempts cap → NeedsHuman (cause: `resume-cap`).
- Otherwise Running → Resuming → Pending. Re-entering via Pending lets
  the reconciler pick the same assignment lane the original task did
  (mailbox-first for tasks with a role, spawn for roleless).

`Action::EscalateHuman`:
- Idempotent against already-terminal tasks.
- Transitions to NeedsHuman with the caller's cause string.

## Verification

cargo check / clippy / fmt — both feature sets, all green.
770 binary lib tests pass (up from 756 in PR5). Six new reconciler
tests cover the resume / health-trigger paths:

- Dead session → Resume(session_died)
- Unknown session → no actuation (Stalled alert ignored)
- Stalled alert → Resume(health:Stalled)
- Loop detected alert → EscalateHuman(health:Loop detected)
- Cost spike alert (Ignore) → no actuation
- Mixed alerts → first actionable wins (Resume from Stalled, not
  Escalate from later Loop)

## Out of scope

- Wiring the live JSONL/ps Sensors to populate `health_alerts` — the
  type surface is here; the runtime adapter lands with PR7's
  supervisor CLI and exporter.
- Cost-spike policy plane (Ignore today). Lands when per-task budget
  enforcement gets its own knobs.
- Autopsy summary injected at resume time — the helper is wired and
  tested; threading it through the actuator's recovery-context build
  needs the live spawn path that's also PR7's pickup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mercurialsolo mercurialsolo merged commit 3aac80e into main Jun 10, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

supervisor M5/PR6: resume protocol + health-check triggers

1 participant