Skip to content

feat(coord): supervisor CLI + v1 NDJSON events + Prometheus exporter + doctor row (#347)#356

Merged
mercurialsolo merged 2 commits into
mainfrom
feat/supervisor-pr7-cli-exporter
Jun 10, 2026
Merged

feat(coord): supervisor CLI + v1 NDJSON events + Prometheus exporter + doctor row (#347)#356
mercurialsolo merged 2 commits into
mainfrom
feat/supervisor-pr7-cli-exporter

Conversation

@mercurialsolo

Copy link
Copy Markdown
Owner

PR7 of the supervisor RFC (#342). Closes #347 (per issue title; numbering swap noted in earlier PRs).

Stacked on #355 — will retarget to main once #355 merges.

The user-facing surfaces: operators can see what the supervisor is doing; CI gates / dashboards consume a frozen event schema; Grafana scrapes a Prometheus exporter.

`claudectl supervisor` subcommand (RFC §10)

New top-level subcommand tree:

Verb Purpose
`run <tasks.toml>` Parse RFC §4 `[[task]]` blocks (incl. nested `[[task.verify]]`) and insert one row per task. `--dry-run` prints what would happen.
`submit` One-shot inline form: `--name --cwd --prompt --role ...`
`status [--state STATE]` Compact task table
`logs <task_id>` Task detail + full transition log
`cancel <task_id>` Idempotent move to CANCELLED
`drain` / `undrain` Sentinel file at `~/.claudectl/coord/drain`; surfaces in doctor row

Hand-rolled TOML reader limited to the subset RFC §4 declares (no new dependency); rejects unknown keys.

v1 NDJSON event schema (`src/coord/events.rs`)

Frozen contract. `{v: 1, type, at, ...payload}` envelope with three event families:

  • `task.transition` — state-machine move
  • `task.verification` — verifier verdict + cost
  • `task.escalated` — NeedsHuman move with reason + addressed_to

Additive-only forever: new event types are added; existing ones never rename fields.

Prometheus exporter (`src/coord/exporter.rs`)

Hand-rolled HTTP listener (no web framework dep) serving `/metrics` in Prometheus text format. Worker thread per scrape so the headless tick loop never blocks. Exposes:

  • `claudectl_tasks_by_state{state}` — gauge
  • `claudectl_fleet_cost_usd_total` — counter (attempt + verifier spend)
  • `claudectl_retries_total{cause}` — counter
  • `claudectl_verifier_pass_rate{kind}` — gauge [0.0, 1.0]

Label escaping handles `\`, `"`, `\n` per Prometheus spec.

Doctor row

`supervisor drain` — Pass when no drain marker, Advisory with the unstick hint when set.

Verification

```
cargo check / clippy / fmt — both feature sets, all green
cargo test --all-targets → 769 + 780 + 78 + 8 = 1635 pass
```

New tests cover:

  • TOML parse: full block with verifier list; multi-task file; rejects unknown keys
  • Event schema: round-trip per family; field shape verified by string inspection (so contract changes show up as test diffs)
  • Exporter: state bucketing, Prometheus format, label escaping, division-by-zero safety

Out of scope (next commit)

  • Wiring the exporter into `run_headless()` with the `--exporter :9464` CLI flag.
  • Emitting v1 events from the reconciler/actuator into the `--watch --json` stream.
  • `SupervisorPhase` in the init wizard.

These three together are about 100 lines of plumbing on top of what this PR ships; the surfaces here are reviewable standalone.

Test plan

  • After merge: `claudectl supervisor submit --name x --cwd $PWD --prompt do_x` → row in `tasks`; `claudectl supervisor status` lists it.
  • `claudectl supervisor logs <task_id>` shows the CREATED → PENDING transition.
  • `claudectl supervisor drain && claudectl doctor` shows the new Advisory row.
  • Write a 2-task `tasks.toml` with a verify block; `claudectl supervisor run tasks.toml --dry-run` lists both; without `--dry-run` actually inserts.

🤖 Generated with Claude Code

…+ doctor row (#347)

PR7 of the supervisor RFC (#342). The user-facing surfaces: operators
can see what the supervisor is doing; CI gates / dashboards consume a
frozen event schema; Grafana scrapes a Prometheus exporter.

## `claudectl supervisor` subcommand (RFC §10)

New top-level subcommand tree (`src/coord/supervisor_cli.rs`):

- `run <tasks.toml> [--dry-run]` — the `--run` alias. Parses RFC §4's
  `[[task]]` blocks (including nested `[[task.verify]]`) and inserts
  one row per task. Dry-run prints what would happen.
- `submit --name --cwd --prompt [--role ...]` — one-shot inline form
  for scripts that don't want a TOML file.
- `status [--state STATE]` — compact task table.
- `logs <task_id>` — task detail + full transition log.
- `cancel <task_id>` — idempotent move to CANCELLED.
- `drain` / `undrain` — sentinel file at
  `~/.claudectl/coord/drain`. Surfaces in the doctor row; PR8 wires the
  reconciler to honor it.

Hand-rolled TOML reader (no new dependency) limited to the
`[[task]]` + `[[task.verify]]` subset the RFC declares; rejects
unknown keys so typos surface fast.

## v1 NDJSON event schema (`src/coord/events.rs`, RFC §10)

Frozen contract. `{v: 1, type, at, ...payload}` envelope; three event
families ship in this PR:

- `task.transition` — state-machine move (from / to / cause).
- `task.verification` — verifier verdict (kind / verdict / cost_usd).
- `task.escalated` — NeedsHuman move with reason + addressed_to.

Additive-only forever: new event types are added; existing ones never
rename fields. The CI gates and Slack bots that build on this need
that guarantee.

## Prometheus exporter (`src/coord/exporter.rs`)

Hand-rolled HTTP listener (no web framework dep) that serves
`/metrics` in Prometheus text format. Worker thread per scrape so the
headless tick loop never blocks. Exposes:

- `claudectl_tasks_by_state{state}` — gauge per state.
- `claudectl_fleet_cost_usd_total` — counter, attempt + verifier
  spend.
- `claudectl_retries_total{cause}` — counter, RETRYING/RESUMING
  transitions.
- `claudectl_verifier_pass_rate{kind}` — gauge [0.0, 1.0] per
  verifier kind.

Label escaping handles `\`, `"`, `\n` per Prometheus spec. The
exporter binds non-blocking with a 100ms poll so shutdown is prompt.

## Doctor row

`supervisor drain` — Pass when no drain marker, Advisory with the
unstick hint when set. Fits next to the other coord/* checks.

## Verification

cargo check / clippy / fmt — both feature sets, all green.
780 binary lib tests pass (up from 770 in PR6). New tests:
- TOML parse: full block with verifier list; multi-task file; rejects
  unknown keys.
- Event schema: round-trip per family; field shape verified by string
  inspection so contract changes show up as test diffs.
- Exporter: state bucketing, Prometheus format, label escaping,
  division-by-zero safety.

## Out of scope

- Wiring the exporter into `run_headless()` — pairs with the
  `--exporter :9464` CLI flag and a graceful-shutdown story; lands as
  a follow-up commit so the surface here can be reviewed standalone.
- Emitting the v1 events from the reconciler/actuator into a NDJSON
  output stream — needs the `--watch --json` glue; today the schema is
  the contract and the actuator already writes the underlying rows.
- New `SupervisorPhase` in the init wizard — the doctor row + the new
  CLI surface cover the discoverability gap for now; the phase lands
  with the exporter wire-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mercurialsolo mercurialsolo force-pushed the feat/supervisor-pr7-cli-exporter branch from 9b0f78c to d69a143 Compare June 10, 2026 18:10
@mercurialsolo mercurialsolo changed the base branch from feat/supervisor-pr6-resume-health to main June 10, 2026 18:10
@mercurialsolo mercurialsolo merged commit a3fbde5 into main Jun 10, 2026
8 checks passed
mercurialsolo added a commit that referenced this pull request Jun 10, 2026
The supervisor RFC (#342) shipped across PRs #350#356 but the
user-facing docs were never updated. A reader of the README right now
sees "the supervisor for long-horizon role persistence" listed as
"not yet built", which is exactly backwards.

## Changes

- **README.md** — Fix the stale "not yet built" claim (flow guards and
  the supervisor are both shipped now) and add a new "## Supervisor"
  section between "Agent Bus" and "Hive Mind". Covers: the design
  argument (parallel runners commoditized, durable supervision is the
  unowned layer); the CLI surface with inline + TOML examples; the
  three-verifier (`run` / `brain` / `agent`) `tasks.toml` shape from
  the RFC; the three load-bearing properties (cattle vs. pets,
  crash-safe from coord.db, fail-closed verifiers); the Prometheus
  exporter metric names.
- **docs/AGENT_BUS.md** — Implementation status table. Phase 6
  (content validation) and Phase 8 (flow guards) marked Shipped with
  the PR references. Phase 10 (Supervisor) now Shipped, with the
  module list pointing at every coord/* file the work landed in.
  Phase 9 (managed-artifact lifecycle) corrected from Not Started to
  Partial since `init --upgrade` covers it.
- **docs/reference.md** — New "Supervisor" command table (run /
  submit / status / logs / cancel / drain / undrain) and a new
  "Ingest" row for the hook signal path. Schema-version-gate
  behavior called out so readers don't hit the refusal in the wild
  with no context. Also drops the "Requires --features coord" note
  since coord is now in the default feature set.
- **docs/quickstart.md** — New "Submit a task to the supervisor"
  optional section, parked next to the existing brain / insights
  optional sections. Points at the README for the full design.
- **CLAUDE.md** — Architecture section gets a new Supervisor block
  listing every file in `src/coord/` and `src/ingest.rs`. The
  schema-version gate (`EXPECTED_COORD_SCHEMA_VERSION = 3`) is
  documented so a future reader knows where to look when migration
  drifts.

## What's deliberately out of scope

- `docs/configuration.md` doesn't get a supervisor section yet —
  `~/.claudectl/coord/policy.toml` isn't being loaded by the
  reconciler yet (the type lives in `supervisor.rs::Policy` with
  defaults). When config loading lands, the TOML schema lands with it.
- `docs/AGENT_BUS.md` §13 (Longevity / supervisor design discussion)
  isn't reworked. That whole section is the RFC narrative; the
  status table at the top is what readers check for "is this real
  yet" and that's now accurate.

No code changes. Existing tests still pass; `cargo fmt --check` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mercurialsolo added a commit that referenced this pull request Jun 10, 2026
…or (#357)

The supervisor RFC (#342) shipped across PRs #350#356 but the
user-facing docs were never updated. A reader of the README right now
sees "the supervisor for long-horizon role persistence" listed as
"not yet built", which is exactly backwards.

## Changes

- **README.md** — Fix the stale "not yet built" claim (flow guards and
  the supervisor are both shipped now) and add a new "## Supervisor"
  section between "Agent Bus" and "Hive Mind". Covers: the design
  argument (parallel runners commoditized, durable supervision is the
  unowned layer); the CLI surface with inline + TOML examples; the
  three-verifier (`run` / `brain` / `agent`) `tasks.toml` shape from
  the RFC; the three load-bearing properties (cattle vs. pets,
  crash-safe from coord.db, fail-closed verifiers); the Prometheus
  exporter metric names.
- **docs/AGENT_BUS.md** — Implementation status table. Phase 6
  (content validation) and Phase 8 (flow guards) marked Shipped with
  the PR references. Phase 10 (Supervisor) now Shipped, with the
  module list pointing at every coord/* file the work landed in.
  Phase 9 (managed-artifact lifecycle) corrected from Not Started to
  Partial since `init --upgrade` covers it.
- **docs/reference.md** — New "Supervisor" command table (run /
  submit / status / logs / cancel / drain / undrain) and a new
  "Ingest" row for the hook signal path. Schema-version-gate
  behavior called out so readers don't hit the refusal in the wild
  with no context. Also drops the "Requires --features coord" note
  since coord is now in the default feature set.
- **docs/quickstart.md** — New "Submit a task to the supervisor"
  optional section, parked next to the existing brain / insights
  optional sections. Points at the README for the full design.
- **CLAUDE.md** — Architecture section gets a new Supervisor block
  listing every file in `src/coord/` and `src/ingest.rs`. The
  schema-version gate (`EXPECTED_COORD_SCHEMA_VERSION = 3`) is
  documented so a future reader knows where to look when migration
  drifts.

## What's deliberately out of scope

- `docs/configuration.md` doesn't get a supervisor section yet —
  `~/.claudectl/coord/policy.toml` isn't being loaded by the
  reconciler yet (the type lives in `supervisor.rs::Policy` with
  defaults). When config loading lands, the TOML schema lands with it.
- `docs/AGENT_BUS.md` §13 (Longevity / supervisor design discussion)
  isn't reworked. That whole section is the RFC narrative; the
  status table at the top is what readers check for "is this real
  yet" and that's now accurate.

No code changes. Existing tests still pass; `cargo fmt --check` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

supervisor M6/PR7: contracts, exporter, doctor, init phase, --run alias

1 participant