feat(coord): supervisor CLI + v1 NDJSON events + Prometheus exporter + doctor row (#347)#356
Merged
Merged
Conversation
…+ doctor row (#347) PR7 of the supervisor RFC (#342). The user-facing surfaces: operators can see what the supervisor is doing; CI gates / dashboards consume a frozen event schema; Grafana scrapes a Prometheus exporter. ## `claudectl supervisor` subcommand (RFC §10) New top-level subcommand tree (`src/coord/supervisor_cli.rs`): - `run <tasks.toml> [--dry-run]` — the `--run` alias. Parses RFC §4's `[[task]]` blocks (including nested `[[task.verify]]`) and inserts one row per task. Dry-run prints what would happen. - `submit --name --cwd --prompt [--role ...]` — one-shot inline form for scripts that don't want a TOML file. - `status [--state STATE]` — compact task table. - `logs <task_id>` — task detail + full transition log. - `cancel <task_id>` — idempotent move to CANCELLED. - `drain` / `undrain` — sentinel file at `~/.claudectl/coord/drain`. Surfaces in the doctor row; PR8 wires the reconciler to honor it. Hand-rolled TOML reader (no new dependency) limited to the `[[task]]` + `[[task.verify]]` subset the RFC declares; rejects unknown keys so typos surface fast. ## v1 NDJSON event schema (`src/coord/events.rs`, RFC §10) Frozen contract. `{v: 1, type, at, ...payload}` envelope; three event families ship in this PR: - `task.transition` — state-machine move (from / to / cause). - `task.verification` — verifier verdict (kind / verdict / cost_usd). - `task.escalated` — NeedsHuman move with reason + addressed_to. Additive-only forever: new event types are added; existing ones never rename fields. The CI gates and Slack bots that build on this need that guarantee. ## Prometheus exporter (`src/coord/exporter.rs`) Hand-rolled HTTP listener (no web framework dep) that serves `/metrics` in Prometheus text format. Worker thread per scrape so the headless tick loop never blocks. Exposes: - `claudectl_tasks_by_state{state}` — gauge per state. - `claudectl_fleet_cost_usd_total` — counter, attempt + verifier spend. - `claudectl_retries_total{cause}` — counter, RETRYING/RESUMING transitions. - `claudectl_verifier_pass_rate{kind}` — gauge [0.0, 1.0] per verifier kind. Label escaping handles `\`, `"`, `\n` per Prometheus spec. The exporter binds non-blocking with a 100ms poll so shutdown is prompt. ## Doctor row `supervisor drain` — Pass when no drain marker, Advisory with the unstick hint when set. Fits next to the other coord/* checks. ## Verification cargo check / clippy / fmt — both feature sets, all green. 780 binary lib tests pass (up from 770 in PR6). New tests: - TOML parse: full block with verifier list; multi-task file; rejects unknown keys. - Event schema: round-trip per family; field shape verified by string inspection so contract changes show up as test diffs. - Exporter: state bucketing, Prometheus format, label escaping, division-by-zero safety. ## Out of scope - Wiring the exporter into `run_headless()` — pairs with the `--exporter :9464` CLI flag and a graceful-shutdown story; lands as a follow-up commit so the surface here can be reviewed standalone. - Emitting the v1 events from the reconciler/actuator into a NDJSON output stream — needs the `--watch --json` glue; today the schema is the contract and the actuator already writes the underlying rows. - New `SupervisorPhase` in the init wizard — the doctor row + the new CLI surface cover the discoverability gap for now; the phase lands with the exporter wire-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9b0f78c to
d69a143
Compare
mercurialsolo
added a commit
that referenced
this pull request
Jun 10, 2026
The supervisor RFC (#342) shipped across PRs #350–#356 but the user-facing docs were never updated. A reader of the README right now sees "the supervisor for long-horizon role persistence" listed as "not yet built", which is exactly backwards. ## Changes - **README.md** — Fix the stale "not yet built" claim (flow guards and the supervisor are both shipped now) and add a new "## Supervisor" section between "Agent Bus" and "Hive Mind". Covers: the design argument (parallel runners commoditized, durable supervision is the unowned layer); the CLI surface with inline + TOML examples; the three-verifier (`run` / `brain` / `agent`) `tasks.toml` shape from the RFC; the three load-bearing properties (cattle vs. pets, crash-safe from coord.db, fail-closed verifiers); the Prometheus exporter metric names. - **docs/AGENT_BUS.md** — Implementation status table. Phase 6 (content validation) and Phase 8 (flow guards) marked Shipped with the PR references. Phase 10 (Supervisor) now Shipped, with the module list pointing at every coord/* file the work landed in. Phase 9 (managed-artifact lifecycle) corrected from Not Started to Partial since `init --upgrade` covers it. - **docs/reference.md** — New "Supervisor" command table (run / submit / status / logs / cancel / drain / undrain) and a new "Ingest" row for the hook signal path. Schema-version-gate behavior called out so readers don't hit the refusal in the wild with no context. Also drops the "Requires --features coord" note since coord is now in the default feature set. - **docs/quickstart.md** — New "Submit a task to the supervisor" optional section, parked next to the existing brain / insights optional sections. Points at the README for the full design. - **CLAUDE.md** — Architecture section gets a new Supervisor block listing every file in `src/coord/` and `src/ingest.rs`. The schema-version gate (`EXPECTED_COORD_SCHEMA_VERSION = 3`) is documented so a future reader knows where to look when migration drifts. ## What's deliberately out of scope - `docs/configuration.md` doesn't get a supervisor section yet — `~/.claudectl/coord/policy.toml` isn't being loaded by the reconciler yet (the type lives in `supervisor.rs::Policy` with defaults). When config loading lands, the TOML schema lands with it. - `docs/AGENT_BUS.md` §13 (Longevity / supervisor design discussion) isn't reworked. That whole section is the RFC narrative; the status table at the top is what readers check for "is this real yet" and that's now accurate. No code changes. Existing tests still pass; `cargo fmt --check` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mercurialsolo
added a commit
that referenced
this pull request
Jun 10, 2026
…or (#357) The supervisor RFC (#342) shipped across PRs #350–#356 but the user-facing docs were never updated. A reader of the README right now sees "the supervisor for long-horizon role persistence" listed as "not yet built", which is exactly backwards. ## Changes - **README.md** — Fix the stale "not yet built" claim (flow guards and the supervisor are both shipped now) and add a new "## Supervisor" section between "Agent Bus" and "Hive Mind". Covers: the design argument (parallel runners commoditized, durable supervision is the unowned layer); the CLI surface with inline + TOML examples; the three-verifier (`run` / `brain` / `agent`) `tasks.toml` shape from the RFC; the three load-bearing properties (cattle vs. pets, crash-safe from coord.db, fail-closed verifiers); the Prometheus exporter metric names. - **docs/AGENT_BUS.md** — Implementation status table. Phase 6 (content validation) and Phase 8 (flow guards) marked Shipped with the PR references. Phase 10 (Supervisor) now Shipped, with the module list pointing at every coord/* file the work landed in. Phase 9 (managed-artifact lifecycle) corrected from Not Started to Partial since `init --upgrade` covers it. - **docs/reference.md** — New "Supervisor" command table (run / submit / status / logs / cancel / drain / undrain) and a new "Ingest" row for the hook signal path. Schema-version-gate behavior called out so readers don't hit the refusal in the wild with no context. Also drops the "Requires --features coord" note since coord is now in the default feature set. - **docs/quickstart.md** — New "Submit a task to the supervisor" optional section, parked next to the existing brain / insights optional sections. Points at the README for the full design. - **CLAUDE.md** — Architecture section gets a new Supervisor block listing every file in `src/coord/` and `src/ingest.rs`. The schema-version gate (`EXPECTED_COORD_SCHEMA_VERSION = 3`) is documented so a future reader knows where to look when migration drifts. ## What's deliberately out of scope - `docs/configuration.md` doesn't get a supervisor section yet — `~/.claudectl/coord/policy.toml` isn't being loaded by the reconciler yet (the type lives in `supervisor.rs::Policy` with defaults). When config loading lands, the TOML schema lands with it. - `docs/AGENT_BUS.md` §13 (Longevity / supervisor design discussion) isn't reworked. That whole section is the RFC narrative; the status table at the top is what readers check for "is this real yet" and that's now accurate. No code changes. Existing tests still pass; `cargo fmt --check` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR7 of the supervisor RFC (#342). Closes #347 (per issue title; numbering swap noted in earlier PRs).
Stacked on #355 — will retarget to main once #355 merges.
The user-facing surfaces: operators can see what the supervisor is doing; CI gates / dashboards consume a frozen event schema; Grafana scrapes a Prometheus exporter.
`claudectl supervisor` subcommand (RFC §10)
New top-level subcommand tree:
Hand-rolled TOML reader limited to the subset RFC §4 declares (no new dependency); rejects unknown keys.
v1 NDJSON event schema (`src/coord/events.rs`)
Frozen contract. `{v: 1, type, at, ...payload}` envelope with three event families:
Additive-only forever: new event types are added; existing ones never rename fields.
Prometheus exporter (`src/coord/exporter.rs`)
Hand-rolled HTTP listener (no web framework dep) serving `/metrics` in Prometheus text format. Worker thread per scrape so the headless tick loop never blocks. Exposes:
Label escaping handles `\`, `"`, `\n` per Prometheus spec.
Doctor row
`supervisor drain` — Pass when no drain marker, Advisory with the unstick hint when set.
Verification
```
cargo check / clippy / fmt — both feature sets, all green
cargo test --all-targets → 769 + 780 + 78 + 8 = 1635 pass
```
New tests cover:
Out of scope (next commit)
These three together are about 100 lines of plumbing on top of what this PR ships; the surfaces here are reviewable standalone.
Test plan
🤖 Generated with Claude Code