feat(coord): resume protocol + health-check transition triggers (#349)#355
Merged
Conversation
PR6 of the supervisor RFC (#342). The supervisor now reacts to session death and the 10 shipped health checks instead of letting them sit as dashboard advisories. ## Resume protocol (`src/coord/resume.rs`, RFC §7) Three contracts: 1. **Tree-state hash** — snapshotted at every spawn/assign, compared at resume. Mismatch = external edits between death and resume; the supervisor escalates to NeedsHuman instead of resuming blindly. Git-based when available, mtime-fallback otherwise; both produce stable strings any consumer can compare for equality. Fails open to `TreeStateHash::empty()` so resume keeps working when git or fs reads fail. 2. **Resume the task, not the session** — no dependence on Claude Code's session-resume internals. `build_recovery_prompt` composes: original prompt → resume framing → drift warning (if any) → prior verifier failures (so the next attempt doesn't repeat them) → autopsy summary. 3. **Bounded retries** — attempts cap from PR4 still applies. Resume bumps attempt; over-cap → NeedsHuman with `resume-cap` cause. `summarize_autopsy` lifts the relevant fields out of `brain::autopsy::AutopsyReport` so this module owns the "what does resume need" view. ## Health-check transition triggers (RFC §6) `Policy` gains a `health_actions: HealthActionMap` keyed on the 10 shipped `HealthCheck::name` strings. Each maps to `Resume`, `Escalate`, or `Ignore`. Defaults match the RFC §6 table: | Health check | Default action | |-----------------------|----------------------| | Stalled | Resume | | Loop detected | Escalate (NeedsHuman)| | Repetition | Escalate (NeedsHuman)| | Error acceleration | Resume | | Cost spike | Ignore | | Context saturation | Ignore (handled by | | | existing pipeline) | | Cognitive decay | Ignore | ## Reconciler additions `ObservedSession` gains `health_alerts: Vec<String>`. The reconciler's Running-task pass now joins each task to its observable session via `tasks::latest_session_id`, emits: - `Action::Resume` when status == Dead (RFC §7's canonical trigger) - `Action::Resume` / `Action::EscalateHuman` per `HealthAction` for the first non-Ignore alert `Unknown` status remains the no-actuation backstop — health alerts on an Unknown session are ignored. First actionable alert wins so we don't emit conflicting Resume + Escalate for the same task on the same tick. ## Actuator additions `Action::Resume`: - Only valid from Running. Stale tick → no-op. - Attempts cap → NeedsHuman (cause: `resume-cap`). - Otherwise Running → Resuming → Pending. Re-entering via Pending lets the reconciler pick the same assignment lane the original task did (mailbox-first for tasks with a role, spawn for roleless). `Action::EscalateHuman`: - Idempotent against already-terminal tasks. - Transitions to NeedsHuman with the caller's cause string. ## Verification cargo check / clippy / fmt — both feature sets, all green. 770 binary lib tests pass (up from 756 in PR5). Six new reconciler tests cover the resume / health-trigger paths: - Dead session → Resume(session_died) - Unknown session → no actuation (Stalled alert ignored) - Stalled alert → Resume(health:Stalled) - Loop detected alert → EscalateHuman(health:Loop detected) - Cost spike alert (Ignore) → no actuation - Mixed alerts → first actionable wins (Resume from Stalled, not Escalate from later Loop) ## Out of scope - Wiring the live JSONL/ps Sensors to populate `health_alerts` — the type surface is here; the runtime adapter lands with PR7's supervisor CLI and exporter. - Cost-spike policy plane (Ignore today). Lands when per-task budget enforcement gets its own knobs. - Autopsy summary injected at resume time — the helper is wired and tested; threading it through the actuator's recovery-context build needs the live spawn path that's also PR7's pickup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR6 of the supervisor RFC (#342). Closes #349.
The supervisor now reacts to session death and the 10 shipped health checks instead of letting them sit as dashboard advisories.
Resume protocol (`src/coord/resume.rs`, RFC §7)
Three contracts:
Health-check transition triggers (RFC §6)
`Policy` gains a `health_actions: HealthActionMap` keyed on the 10 shipped `HealthCheck::name` strings. Each maps to `Resume`, `Escalate`, or `Ignore`. Defaults match the RFC §6 table:
Reconciler additions
`ObservedSession` gains `health_alerts: Vec`. The Running-task pass joins each task to its observable session via `tasks::latest_session_id` and emits:
`Unknown` status remains the no-actuation backstop — health alerts on an Unknown session are ignored. First actionable alert wins so we don't emit conflicting Resume + Escalate for the same task on the same tick.
Actuator additions
`Action::Resume`: Only valid from Running. Attempts cap → NeedsHuman (`resume-cap`). Otherwise Running → Resuming → Pending. Re-entering via Pending lets the reconciler pick the same assignment lane the original task did.
`Action::EscalateHuman`: Idempotent against already-terminal tasks. Transitions to NeedsHuman with the caller's cause string.
Verification
```
cargo check / clippy / fmt — both feature sets, all green
cargo test --all-targets → 759 + 770 + 78 + 8 = 1615 pass
```
Six new reconciler tests cover the full path:
Out of scope
Test plan
🤖 Generated with Claude Code