Skip to content

supervisor M2/PR3: coord task schema + reconciler skeleton #346

@mercurialsolo

Description

@mercurialsolo

Part of #342. Depends on #344.

Scope

Schema (src/coord/store.rs::migrate())

Add four tables. Migration runs free via existing upgrade_db_migrations() step in init --upgrade.

CREATE TABLE tasks (
  id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  state TEXT NOT NULL,        -- PENDING/READY/ASSIGNED/RUNNING/VERIFYING/DONE/RETRYING/RESUMING/NEEDS_HUMAN/CANCELLED
  role TEXT,                  -- target role (mailbox-first)
  cwd TEXT NOT NULL,
  prompt TEXT NOT NULL,
  model TEXT,
  budget_usd REAL,
  max_retries INTEGER DEFAULT 2,
  timeout_min INTEGER DEFAULT 45,
  depends_on TEXT,            -- JSON array of task ids
  policy TEXT,                -- JSON: per-task overrides (force_manual, etc.)
  created_at INTEGER NOT NULL,
  updated_at INTEGER NOT NULL
);

CREATE TABLE task_attempts (
  id TEXT PRIMARY KEY,
  task_id TEXT NOT NULL REFERENCES tasks(id),
  attempt_num INTEGER NOT NULL,
  session_id TEXT,            -- live session (or null while ASSIGNED)
  bus_message_id TEXT,        -- mailbox message that assigned this attempt
  cwd_hash TEXT,              -- tree-state hash at spawn/assign
  started_at INTEGER NOT NULL,
  ended_at INTEGER,
  cost_usd REAL DEFAULT 0,
  outcome TEXT                -- success/verify_fail/timeout/budget/session_died/cancelled
);

CREATE TABLE task_verifications (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  attempt_id TEXT NOT NULL REFERENCES task_attempts(id),
  kind TEXT NOT NULL,         -- run/brain/agent
  command TEXT NOT NULL,      -- shell cmd, brain prompt, or agent prompt
  verdict TEXT NOT NULL,      -- PASS/FAIL
  output TEXT,                -- truncated
  cost_usd REAL DEFAULT 0,
  ran_at INTEGER NOT NULL
);

CREATE TABLE task_transitions (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  task_id TEXT NOT NULL REFERENCES tasks(id),
  from_state TEXT NOT NULL,
  to_state TEXT NOT NULL,
  cause TEXT NOT NULL,        -- assigned/spawned/claim_timeout/verify_pass/verify_fail/timeout/budget/health_<check>/manual/...
  at INTEGER NOT NULL
);

Reconciler (src/coord/supervisor.rs)

pub struct Supervisor { tick_ms: u64, policy: Policy }
impl Supervisor {
    pub fn tick(&mut self, coord: &mut CoordStore, sensors: &dyn Sensors) -> Vec<Action>;
}
pub enum Action { AssignMailbox(TaskId, Role), Spawn(TaskId, Cwd), Retry(TaskId), MarkDone(TaskId), EscalateHuman(TaskId, String), ... }

Reconciler is pure — reads desired state from coord, observed from sensors, returns actions. A separate Actuator performs them. Makes testing trivial.

Wiring

One call site: run_headless() at src/commands.rs:962, behind #[cfg(feature = "coord")]. No new process model.

Crash-safety test (load-bearing)

This is THE claim of the whole RFC — write the test before adding features on top.

  • Start headless daemon, submit 3 tasks, wait until one is RUNNING.
  • kill -9 the daemon.
  • Restart. Within 3 ticks, observed state matches pre-kill state. Same task counts in each bucket. No duplicate spawns. No lost attempts.

Out of scope (deferred to later PRs)

  • Verifier execution (PR5).
  • Bus-mailbox assignment (PR4).
  • Resume context via autopsy (PR6).
  • TUI surface for tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priorityarchitectureCore architecture decisionsenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions