diff --git a/docs/self-evolution-upgrade-plan.md b/docs/self-evolution-upgrade-plan.md new file mode 100644 index 00000000..3c0ae1cb --- /dev/null +++ b/docs/self-evolution-upgrade-plan.md @@ -0,0 +1,74 @@ +# Plan: Hermes-style skill self-evolution — full upgrade + +> Status: **in progress** on branch `feat/skill-evolution-upgrade` (worktree-isolated). +> Builds on the existing `evolution/` package (ported from NousResearch/hermes-agent-self-evolution) +> and the `docs/self-learner.md` Stop-hook loop. The loop already exists; this makes it +> real, well-judged, and wired into `core` — default-OFF, propose-only. + +## Why this work + +The Hermes-ported evolver is complete but **dormant**: default-OFF, pilot-wired into +`skill-writer` only, never run (0 `skill_gap` events on the dev machine), and carrying two +documented quality caveats (keyword-overlap metric by default; eval scores a synthetic proxy, +not real Claude Code behaviour). The **automated** path uses the `single-shot` optimizer +(`auto_evolve.py` → `evolve(optimizer="single-shot")`), *not* GEPA — so a one-shot rewrite + +one text-diff judge is all that runs today. + +## Resolved decisions + +- **D1 — Writer-critic loop, default 2 rounds** (`CUE_WRITER_LOOP_ROUNDS`). Retry on `WORSE` + always; on `EQUAL` only when the critic returned actionable fixes and rounds remain. + `propose_improved_body`/`judge_is_better` stay as back-compat wrappers (hooks import by name). +- **D2 — Task-grounded critic (DSPy-free):** critic runs ONE mined task through the candidate + skill via `run_claude_p`, then judges the real transcript with the existing `run_claude_p` + judge prompt. Soft-falls-back to text-diff review. 1 subagent call/round — not the + 20×/iteration cost bomb. +- **D3 — Judge defaults (GEPA/holdout path):** acceptance/holdout metric default `overlap → judge` + (LLMJudge); GEPA *inner* metric stays `overlap` (cost), `judge` opt-in; new `--metric subagent` + is holdout-only, cost-flagged, soft-fallback. `--eval-source` default `synthetic → auto`. +- **D4 — Activation:** wire `profile-self-improve.json` + `auto-evolve.json` + `learnings-surface.json` + into `core`, default-OFF behind the flag files; fix the `CUE_EVOLUTION_DIR` portability gap. + Enabling on a machine is a separate explicit step, propose-only. +- **D5 — propose-only everywhere.** No auto-apply enabled by this work. + +## Stages (each independently verifiable + revertable) + +| Stage | Work | Verify | +|---|---|---| +| 0 | Worktree + baseline | tests green (78p/2s) + `auto_evolve --dry-run` ✅ | +| 1 | Writer-critic loop (`reflective.py`, `evolve_skill.py`, `config.py`) | retry-logic unit test; `evolve --propose-only` → lint-passing proposal logged `optimizer:"writer-loop"` | +| 2a | **Task-grounded critic** (`reflective.py`, `evolve_skill.py`) — critic runs the candidate on a real mined task (`run_claude_p`) and judges the transcript | ✅ done: grounding + mining unit tests; loop feeds the critic a real transcript | +| 2b/2c | GEPA `judge` default + `SubagentJudgeMetric` (holdout) | **deferred** — see below | +| 3 | Activate into `core` (`profiles/core/profile.yaml`, `auto-evolve.sh`) | `cue validate` clean; materialized `settings.json` shows both Stop hooks; flags-OFF = no-op | +| 4 | Review + ship | no CRITICAL/HIGH; full suite green; gated PR | + +## Deferred: Stage 2b/2c (GEPA judge default + subagent holdout metric) + +Cut from this pass on purpose — they only touch the **manual GEPA** path (the +automated Stop-hook loop runs `single-shot`, never GEPA), and they **cannot be +live-verified in this environment** (`dspy` import is broken). Spirit of "default +to the LLM judge on real behaviour" is already delivered for the path that runs +by Stage 2a. Ready-to-execute change-points when `dspy` works: + +- **2b — default the holdout/acceptance metric to `judge`.** Do NOT naively flip + `evolve_skill.py:~205` `metric_mode` default `overlap → judge`: that puts + `LLMJudge` in GEPA's *inner* loop (~`max_metric_calls` calls/run — a cost bomb). + Instead split it: keep GEPA's inner `fitness_metric` on `overlap`, and build a + separate `holdout_metric = make_judge_metric(config, skill_text=...)` + (`CUE_EVOLVE_HOLDOUT_METRIC`, default `judge`, soft-fallback) used only in the + holdout loop at `evolve_skill.py:~410-413`. +- **2c — `SubagentJudgeMetric`** (`fitness.py`, beside `make_judge_metric`): a + metric that runs the candidate through `run_claude_p` on a holdout example and + feeds the transcript to `LLMJudge.score()`. Holdout-ONLY (one subprocess/example + ≈120s); never pass it as the GEPA inner metric. Soft-fallback to overlap when + `claude` is absent. `--eval-source` default `synthetic → auto` (sessiondb then + synthetic) at `evolve_skill.py:~155`. + +## Environment notes + +- `evolution/.venv` is gitignored → absent in a worktree. Run edited source via the main venv: + `PYTHONPATH=/evolution /home/deadpool/Documents/cue/evolution/.venv/bin/python -m ...` + (PYTHONPATH shadows the editable main install — verified). +- `dspy` import is broken in this env (`libstdc++.so.6` missing), so live GEPA/LLMJudge can't run + here. Stage 1 is DSPy-free and fully verifiable; Stage 2 relies on the repo's existing dspy-mock + seam tests. Fixing dspy = a system-lib install (out of scope, network/sudo). diff --git a/evolution/evolution/core/config.py b/evolution/evolution/core/config.py index eb659ec3..6d0fab8a 100644 --- a/evolution/evolution/core/config.py +++ b/evolution/evolution/core/config.py @@ -45,6 +45,12 @@ class CueEvolutionConfig: # Optimization parameters iterations: int = 10 population_size: int = 5 + # Single-shot writer->lint->critic loop: how many writer rounds to spend + # repairing lint failures / acting on critic fixes before giving up. 1 = the + # old one-shot behaviour. Env-overridable so the Stop-hook loop can tune it. + writer_loop_rounds: int = field( + default_factory=lambda: max(1, int(os.getenv("CUE_WRITER_LOOP_ROUNDS", "2"))) + ) # LLM configuration (provider inferred from the string prefix by LiteLLM) optimizer_model: str = _DEFAULT_OPTIMIZER_MODEL diff --git a/evolution/evolution/skills/evolve_skill.py b/evolution/evolution/skills/evolve_skill.py index 6c251445..416e447a 100644 --- a/evolution/evolution/skills/evolve_skill.py +++ b/evolution/evolution/skills/evolve_skill.py @@ -145,6 +145,40 @@ def _finalize(config, skill_id, skill, skill_path, evolved_body, candidate_ok, return 0 +def _representative_task(config, skill_id: str) -> str: + """The most recent real user prompt that triggered a skill_gap for this skill, + mined from ~/.config/cue/analytics.jsonl (DSPy-free, stdlib only). + + This is what grounds the critic in GENUINE Claude Code usage: the writer's + rewrite is judged by how it behaves on the very task that exposed the gap. + Returns "" when there's no usable history (fresh machine, or the gap carried + no first_prompt) — the critic then falls back to text-only review. + """ + path = config.analytics_log + if not path.exists(): + return "" + best_ts, best_prompt = "", "" + try: + with open(path, encoding="utf-8") as f: + for line in f: + if '"skill_gap"' not in line: + continue + try: + ev = json.loads(line) + except json.JSONDecodeError: + continue + if ev.get("event") != "skill_gap" or ev.get("skill") != skill_id: + continue + fp = (ev.get("first_prompt") or "").strip() + ts = ev.get("ts", "") + # ISO-8601 ts strings sort lexicographically by recency. + if fp and ts >= best_ts: + best_ts, best_prompt = ts, fp + except OSError: + return "" + return best_prompt + + def evolve( skill_id: str, iterations: int = 10, @@ -214,8 +248,8 @@ def evolve( console.print("\n[bold green]DRY RUN — cue wiring validated.[/bold green]") console.print(f" Optimizer: {optimizer}") if optimizer == "single-shot": - console.print(f" Would propose an improved body in 1 `claude -p` call " - f"({claude_or_model(config)}), no DSPy/dataset") + console.print(f" Would run a writer→lint→critic loop (≤{config.writer_loop_rounds} " + f"round(s)) of `claude -p` calls ({claude_or_model(config)}), no DSPy/dataset") else: console.print(f" Would build eval dataset (source: {eval_source})") console.print(f" Would run GEPA ({iterations} iters, optimizer={config.optimizer_model})") @@ -226,39 +260,61 @@ def evolve( console.print(f" Backups + log → {config.evolution_log}") return 0 - # ── Single-shot optimizer: one claude -p call, no DSPy, no dataset, no key ── + # ── Single-shot optimizer: a short writer→lint→critic loop of claude -p ── + # calls, no DSPy, no dataset, no key. This is the path the Stop-hook loop + # runs. The writer proposes a body, the cue gate lints it, an INDEPENDENT + # critic (reviewer_model, not the writer) judges it, and the writer retries + # with the lint errors + critic fixes until BETTER or the round budget ends. if optimizer == "single-shot": - from evolution.skills.reflective import propose_improved_body, judge_is_better - console.print(f"\n[bold cyan]Single-shot reflective improve[/bold cyan] " - f"({claude_or_model(config)})...") - evolved_body = propose_improved_body(skill, config) - evolved_full = reassemble_skill(skill["frontmatter"], evolved_body) - console.print("\n[bold]Candidate constraints[/bold]") - candidate_results = validator.validate_all( - evolved_body, evolved_full, baseline_body=skill["body"]) - candidate_ok = _print_constraints(candidate_results) - - # Quality gate: an INDEPENDENT reviewer (config.reviewer_model, not the - # proposer) judges evolved vs baseline, fed the deterministic gate results - # as evidence. Auto-apply only on a BETTER verdict (skip the call in - # propose-only or when nothing changed — then it can't apply anyway). - quality_ok, judge_reason = None, "" - changed = evolved_body.strip() != skill["body"].strip() - if not propose_only and candidate_ok and changed: + from evolution.skills.reflective import writer_critic_loop + console.print(f"\n[bold cyan]Writer→lint→critic loop[/bold cyan] " + f"(≤{config.writer_loop_rounds} round(s), {claude_or_model(config)} writer / " + f"{claude_model_name(config.reviewer_model)} critic)...") + + def _validate(body: str) -> dict: + """Run the full cue constraint gate on a candidate body and package + the result for the loop: pass/fail, the per-constraint evidence the + critic reads, and the failing-constraint messages the writer repairs.""" + full = reassemble_skill(skill["frontmatter"], body) + results = validator.validate_all(body, full, baseline_body=skill["body"]) + ok = all(c.passed for c in results) evidence = "; ".join(f"{c.constraint_name}: {'pass' if c.passed else 'FAIL'}" - for c in candidate_results) - console.print(f"[bold]Independent review[/bold] ({claude_model_name(config.reviewer_model)}, " - f"evolved vs baseline)...") - quality_ok, judge_reason = judge_is_better( - skill, evolved_body, config, evidence=evidence) - console.print(f" {'✓' if quality_ok else '✗'} {judge_reason}") + for c in results) + lint_errors = "; ".join(f"{c.constraint_name}: {c.message}" + for c in results if not c.passed) + return {"ok": ok, "results": results, "evidence": evidence, "lint_errors": lint_errors} + + # Ground the critic in real usage: the most recent task that flagged this + # skill as a gap (mined from analytics.jsonl). "" → text-only review. + task_input = _representative_task(config, skill_id) + if task_input: + console.print(f" [dim]grounding critic on a real mined task " + f"({len(task_input)} chars)[/dim]") + + # The loop is propose-only-agnostic: it always iterates writer→critic to + # produce the best proposal; _finalize (below) is what refuses to APPLY + # when propose_only is set. So propose_only is NOT passed to the loop. + loop = writer_critic_loop( + skill, config, validate_fn=_validate, max_rounds=config.writer_loop_rounds, + task_input=task_input, console=console) + evolved_body = loop["body"] + console.print("\n[bold]Final candidate constraints[/bold]") + candidate_ok = (_print_constraints(loop["results"]) if loop["results"] is not None + else False) + + # The loop always returns an explicit quality_ok bool from the critic + # (or False when nothing changed / no candidate), so _finalize never falls + # through to its no-judge branch. In propose-only nothing is applied + # regardless, but the verdict is still logged for review. + quality_ok = loop["quality_ok"] + judge_reason = loop["judge_reason"] return _finalize( config, skill_id, skill, skill_path, evolved_body, candidate_ok, improvement=None, quality_ok=quality_ok, propose_only=propose_only, - extra_meta={"optimizer": "single-shot", "optimizer_model": config.optimizer_model, - "baseline_size": len(skill["body"]), "evolved_size": len(evolved_body), - "judge": judge_reason}) + extra_meta={"optimizer": "writer-loop", "optimizer_model": config.optimizer_model, + "rounds": loop["rounds"], "baseline_size": len(skill["body"]), + "evolved_size": len(evolved_body), "judge": judge_reason}) # ── Heavy imports happen ONLY for a real GEPA run ──────────────────── try: diff --git a/evolution/evolution/skills/reflective.py b/evolution/evolution/skills/reflective.py index c30cd2a1..74cd675d 100644 --- a/evolution/evolution/skills/reflective.py +++ b/evolution/evolution/skills/reflective.py @@ -1,12 +1,19 @@ """Single-shot reflective skill-body improver — the lightweight optimizer. GEPA's iterative loop makes dozens of LLM calls and is slow on every backend. -This optimizer does the job in ONE `claude -p` call: read the current skill body -(plus any friction signals), propose an improved body, return it. The caller -then runs the same `cue lint-skill` gate + apply/proposal decision as GEPA. +This optimizer does the job in a few `claude -p` calls: read the current skill +body (plus any friction signals), propose an improved body, lint it, and let an +INDEPENDENT critic judge it — retrying the writer with the lint errors and the +critic's fixes as feedback until it passes or the round budget runs out. The +caller then runs the same `cue lint-skill` gate + apply/proposal decision. Needs NO DSPy and NO API key — just `claude -p` (the user's Claude Code auth). -This is the path that actually delivers "skills improve as you use cue" cheaply. +This is the path that actually delivers "skills improve as you use cue", and it +is the path the automated Stop-hook loop runs (single-shot, propose-only). + +The multi-agent loop (writer -> lint -> critic -> retry) lives in +`writer_critic_loop`. `propose_improved_body` / `judge_is_better` remain as +thin one-shot wrappers so existing callers keep working. """ import re @@ -43,19 +50,46 @@ """ -def propose_improved_body(skill: dict, config, signals: str = "", timeout: int = 300) -> str: - """One `claude -p` call → an improved skill body string. +_LINT_RETRY_TMPL = """ +Your PREVIOUS attempt FAILED the cue gate. Fix exactly these and try again, +keeping every critical command/path/flag: +{lint} +""" + +_CRITIC_FIX_TMPL = """ +An independent reviewer judged a prior rewrite "{verdict_reason}". +Make these concrete fixes this time: +{fixes} +""" + - Falls back to the original body if the model returns nothing usable (so a - bad response becomes a no-op "body unchanged", never a broken skill). +def writer_step(skill: dict, config, signals: str = "", lint_feedback: str = "", + timeout: int = 300) -> str: + """One writer `claude -p` call → an improved skill body string. + + `signals` carries observed friction plus any critic fixes from a prior round; + `lint_feedback` carries the `cue lint-skill` errors from a prior round so the + writer can repair them. Falls back to the original body if the model returns + nothing usable (a bad response becomes a no-op "body unchanged", never a + broken skill). """ model = claude_model_name(config.optimizer_model) - signals_block = f"\nObserved friction to address:\n{signals}\n" if signals.strip() else "" - prompt = _PROMPT.format(desc=skill["description"], body=skill["body"], signals_block=signals_block) + blocks = "" + if signals.strip(): + blocks += f"\nObserved friction / fixes to address:\n{signals}\n" + if lint_feedback.strip(): + blocks += _LINT_RETRY_TMPL.format(lint=lint_feedback) + prompt = _PROMPT.format(desc=skill["description"], body=skill["body"], signals_block=blocks) out = run_claude_p(prompt, model=model, timeout=timeout) return _extract_body(out, fallback=skill["body"]) +def propose_improved_body(skill: dict, config, signals: str = "", timeout: int = 300) -> str: + """Back-compat one-shot writer (no lint-retry loop). Thin wrapper over + `writer_step` so existing callers and tests keep working.""" + return writer_step(skill, config, signals=signals, timeout=timeout) + + _JUDGE_PROMPT = """You are a strict, INDEPENDENT reviewer of Claude Code SKILL.md bodies — you did not write the revision and have no stake in it. Decide whether the REVISED body is genuinely BETTER than the ORIGINAL for an agent deciding when and how to use this @@ -74,32 +108,175 @@ def propose_improved_body(skill: dict, config, signals: str = "", timeout: int = {revised} +{demo_block} +Reply on the FIRST line, exactly: VERDICT: BETTER|EQUAL|WORSE — +If the verdict is EQUAL or WORSE, add a SECOND line listing concrete, specific +fixes the writer should make next: +FIXES: """ -Reply on a single line, exactly: VERDICT: BETTER|EQUAL|WORSE — """ +def critic_step(skill: dict, evolved_body: str, config, evidence: str = "", + task_demo: str = "", timeout: int = 180): + """Independent reviewer `claude -p` call. Returns + (is_better: bool, reason: str, suggested_fixes: str). -def judge_is_better(skill: dict, evolved_body: str, config, timeout: int = 180, - evidence: str = ""): - """Independent reviewer `claude -p` call: is the evolved body genuinely - better? Returns (is_better: bool, reason: str). Conservative — anything but - BETTER → False. - - Uses `config.reviewer_model` (a DIFFERENT, stronger model than the proposer's - `optimizer_model`) so the rewrite isn't graded by its own author, and is fed - deterministic `evidence` (lint/size/token-preservation results) to anchor the - verdict in facts rather than vibes. + Conservative — anything but BETTER → is_better False. On a non-BETTER verdict + the critic also returns actionable `suggested_fixes` so the writer can repair + the candidate on the next round. Uses `config.reviewer_model` (a DIFFERENT, + stronger model than the writer's `optimizer_model`) so the rewrite isn't + graded by its own author, anchored on deterministic `evidence` (lint/size/ + token-preservation results). + + When `task_demo` is supplied (a transcript of the REVISED skill running on a + real mined task), the critic judges actual behaviour, not just prose — this + is the "score by running through a real Claude Code subagent on a mined task" + signal, grounded in genuine usage. """ model = claude_model_name(config.reviewer_model) evidence_block = f"\nDeterministic gate results (already checked):\n{evidence}\n" if evidence.strip() else "" + demo_block = (f"\nHow the REVISED skill actually behaved on a real past task that" + f" needed it (judge the BEHAVIOUR, not just the prose):\n\n" + f"{task_demo.strip()[:4000]}\n\n") if task_demo.strip() else "" prompt = _JUDGE_PROMPT.format( desc=skill["description"], original=skill["body"], revised=evolved_body, - evidence_block=evidence_block) + evidence_block=evidence_block, demo_block=demo_block) out = run_claude_p(prompt, model=model, timeout=timeout) - m = re.search(r"VERDICT:\s*(BETTER|EQUAL|WORSE)\s*[—\-:]*\s*(.*)", out, re.IGNORECASE) + m = re.search(r"VERDICT:\s*(BETTER|EQUAL|WORSE)\s*[—\-:]*\s*([^\n]*)", out, re.IGNORECASE) if not m: - return False, f"unparseable judge verdict: {out.strip()[:120]}" + return False, f"unparseable judge verdict: {out.strip()[:120]}", "" verdict = m.group(1).upper() - return verdict == "BETTER", f"{verdict}: {m.group(2).strip()[:160]}" + reason = f"{verdict}: {m.group(2).strip()[:160]}" + fm = re.search(r"FIXES:\s*(.+)", out, re.IGNORECASE | re.DOTALL) + fixes = fm.group(1).strip()[:400] if fm else "" + return verdict == "BETTER", reason, fixes + + +def judge_is_better(skill: dict, evolved_body: str, config, timeout: int = 180, + evidence: str = ""): + """Back-compat one-shot reviewer. Thin wrapper over `critic_step` that drops + the suggested-fixes field. Returns (is_better, reason).""" + is_better, reason, _ = critic_step(skill, evolved_body, config, evidence=evidence, timeout=timeout) + return is_better, reason + + +_SKILL_RUN_PROMPT = """You are using the following Claude Code SKILL to handle a user request. +Follow the skill's instructions exactly as written; do not improvise beyond it. + +--- SKILL (instructions to follow) --- +{body} +--- END SKILL --- + +User request: +{task} + +Respond exactly as you would when actually performing this task using the skill.""" + + +def run_skill_on_task(body: str, task: str, config, timeout: int = 180) -> str: + """Run a candidate skill body AS INSTRUCTIONS against a real mined task via + `claude -p`, returning the transcript the critic then judges. DSPy-free. + + Fails soft to "" on any `claude -p` outage, so grounding is best-effort: a + missing transcript just drops the critic back to text-only review. + """ + if not task.strip(): + return "" + model = claude_model_name(config.optimizer_model) + try: + return run_claude_p(_SKILL_RUN_PROMPT.format(body=body, task=task.strip()), + model=model, timeout=timeout) + except RuntimeError: + return "" + + +def writer_critic_loop(skill: dict, config, validate_fn, max_rounds: int = 2, + signals: str = "", task_input: str = "", console=None, + timeout_write: int = 300, timeout_judge: int = 180) -> dict: + """Multi-agent writer -> lint -> critic loop (all `claude -p`, DSPy-free). + + Each round: + 1. WRITER proposes a body (fed the prior round's lint errors AND the + critic's suggested fixes as feedback). + 2. `validate_fn(body)` runs the cue constraint gate (incl. `cue lint-skill`) + and returns {ok, results, evidence, lint_errors}. + 3. If lint fails and rounds remain → retry the writer with the lint errors. + 4. CRITIC judges evolved vs baseline. BETTER → accept and stop. EQUAL/WORSE + with rounds remaining → retry the writer with the critic's fixes. + + Returns the best body seen (a lint-clean candidate is preferred over a + lint-failing one): + {body, candidate_ok, results, quality_ok, judge_reason, rounds} + + Runs in propose-only too — the loop still iterates to produce a higher-quality + PROPOSAL; the caller's `_finalize` is what refuses to apply. Cost scales with + rounds: up to `max_rounds` writer calls + up to `max_rounds` critic calls. + """ + max_rounds = max(1, int(max_rounds)) + best = {"body": skill["body"], "candidate_ok": False, "results": None, + "quality_ok": False, "judge_reason": "no candidate", "rounds": 0} + lint_feedback = "" + critic_fixes = "" + + for rnd in range(1, max_rounds + 1): + extra = "\n".join(s for s in (signals, critic_fixes) if s.strip()) + # Fail-soft: a `claude -p` outage must degrade to "keep best so far" + # (a proposal), never crash the run — the Stop-hook loop runs unattended. + try: + body = writer_step(skill, config, signals=extra, lint_feedback=lint_feedback, + timeout=timeout_write) + except RuntimeError as exc: + if console: + console.print(f" [yellow]✗ round {rnd}: writer unavailable ({exc})[/yellow]") + break + v = validate_fn(body) + changed = body.strip() != skill["body"].strip() + + # Keep this candidate as best-so-far if it's lint-clean, or if we have + # nothing better yet (first attempt). A later lint-failing round must + # never clobber an earlier lint-clean one. + if v["ok"] or best["results"] is None: + best = {"body": body, "candidate_ok": v["ok"], "results": v["results"], + "quality_ok": False, + # "" gets overwritten by the critic below; a lint-failing + # candidate never reaches the critic, so log WHY, not blank. + "judge_reason": "" if v["ok"] else "lint failed; not judged", + "rounds": rnd} + + if not v["ok"]: + lint_feedback = v["lint_errors"] + if console: + console.print(f" [yellow]✗ round {rnd}: lint failed → retry with feedback[/yellow]") + continue + + if not changed: + best["judge_reason"] = "body unchanged" + break + + # Ground the critic in real behaviour: run the candidate on a mined task + # and let the critic judge the transcript, not just the diff. Best-effort. + task_demo = run_skill_on_task(body, task_input, config, timeout=timeout_judge) if task_input.strip() else "" + try: + is_better, reason, fixes = critic_step( + skill, body, config, evidence=v["evidence"], task_demo=task_demo, + timeout=timeout_judge) + except RuntimeError as exc: + # Critic outage: keep the lint-clean candidate as a proposal, don't apply. + best["quality_ok"], best["judge_reason"] = False, f"critic unavailable: {exc}" + if console: + console.print(f" [yellow]✗ round {rnd} critic unavailable ({exc})[/yellow]") + break + best["quality_ok"], best["judge_reason"] = is_better, reason + if console: + console.print(f" [{'green' if is_better else 'yellow'}]" + f"{'✓' if is_better else '✗'} round {rnd} critic: {reason}[/]") + if is_better: + break + + # Not better: feed the critic's fixes back to the writer for another round. + critic_fixes = _CRITIC_FIX_TMPL.format(verdict_reason=reason, fixes=fixes) if fixes else "" + lint_feedback = "" # lint passed this round; don't re-send stale errors + + return best def _extract_body(text: str, fallback: str) -> str: diff --git a/evolution/tests/test_writer_loop.py b/evolution/tests/test_writer_loop.py new file mode 100644 index 00000000..53915c42 --- /dev/null +++ b/evolution/tests/test_writer_loop.py @@ -0,0 +1,294 @@ +"""Seam tests for the single-shot writer->lint->critic loop (reflective.py). + +DSPy-free: every model call is `run_claude_p`, which we monkeypatch. We route a +faked response by prompt content (writer prompts carry the body block; +critic prompts ask for a VERDICT line). validate_fn is a test double that fails +lint for any body containing the marker "LINTFAIL". +""" + +import json +from types import SimpleNamespace + +import pytest + +from evolution.skills import reflective + + +def _cfg(): + # claude_model_name only needs the model strings; run_claude_p is patched. + return SimpleNamespace(optimizer_model="claude-code/sonnet", + reviewer_model="claude-code/opus") + + +def _skill(body="OLD BODY"): + return {"description": "a test skill", "body": body} + + +class FakeClaude: + """Scripted run_claude_p. `writer` and `critic` are lists of responses popped + per call; records every prompt for assertions.""" + + def __init__(self, writer, critic): + self.writer = list(writer) + self.critic = list(critic) + self.writer_prompts = [] + self.critic_prompts = [] + + def __call__(self, prompt, model="sonnet", timeout=300): + if "" in prompt: # writer + self.writer_prompts.append(prompt) + return self.writer.pop(0) + # critic + self.critic_prompts.append(prompt) + return self.critic.pop(0) + + +def _validate(body): + ok = "LINTFAIL" not in body + return { + "ok": ok, + "results": [SimpleNamespace(passed=ok, constraint_name="lint", + message="ok" if ok else "R001: body too long")], + "evidence": f"lint: {'pass' if ok else 'FAIL'}", + "lint_errors": "" if ok else "lint: R001: body too long", + } + + +def _wrap(body): + return f"{body}" + + +def test_happy_path_one_round_better(monkeypatch): + fake = FakeClaude(writer=[_wrap("NEW GOOD BODY")], + critic=["VERDICT: BETTER — clearer triggers"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "NEW GOOD BODY" + assert out["candidate_ok"] is True + assert out["quality_ok"] is True + assert out["rounds"] == 1 + assert out["judge_reason"].startswith("BETTER") + assert len(fake.writer_prompts) == 1 and len(fake.critic_prompts) == 1 + + +def test_lint_fail_then_retry_passes(monkeypatch): + fake = FakeClaude(writer=[_wrap("LINTFAIL body"), _wrap("clean body v2")], + critic=["VERDICT: BETTER — good"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "clean body v2" + assert out["candidate_ok"] is True + assert out["quality_ok"] is True + assert out["rounds"] == 2 + # the 2nd writer call must have been told about the lint failure + assert "R001" in fake.writer_prompts[1] or "FAILED the cue gate" in fake.writer_prompts[1] + + +def test_critic_worse_feeds_fixes_back(monkeypatch): + fake = FakeClaude( + writer=[_wrap("v1 clean"), _wrap("v2 clean")], + critic=["VERDICT: WORSE — dropped a flag\nFIXES: restore the --json flag", + "VERDICT: EQUAL — no real gain\nFIXES: tighten the trigger"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "v2 clean" # last lint-clean candidate kept + assert out["candidate_ok"] is True + assert out["quality_ok"] is False # never reached BETTER + assert out["rounds"] == 2 + assert out["judge_reason"].startswith("EQUAL") + # round-2 writer prompt carries the critic's fixes from round 1 + assert "restore the --json flag" in fake.writer_prompts[1] + + +def test_lintfail_round_never_clobbers_clean_best(monkeypatch): + # round1 clean but WORSE; round2 lint-fails → best must remain round1's clean body + fake = FakeClaude(writer=[_wrap("clean v1"), _wrap("LINTFAIL v2")], + critic=["VERDICT: WORSE — meh\nFIXES: do better"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "clean v1" + assert out["candidate_ok"] is True + + +def test_critic_outage_keeps_proposal(monkeypatch): + def boom_or_write(prompt, model="sonnet", timeout=300): + if "" in prompt: + return _wrap("clean candidate") + raise RuntimeError("claude CLI not on PATH") + monkeypatch.setattr(reflective, "run_claude_p", boom_or_write) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "clean candidate" + assert out["candidate_ok"] is True + assert out["quality_ok"] is False + assert "critic unavailable" in out["judge_reason"] + + +def test_writer_outage_falls_back_to_original(monkeypatch): + def boom(prompt, model="sonnet", timeout=300): + raise RuntimeError("claude -p timed out") + monkeypatch.setattr(reflective, "run_claude_p", boom) + out = reflective.writer_critic_loop(_skill("ORIGINAL"), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["body"] == "ORIGINAL" + assert out["quality_ok"] is False + + +def test_critic_step_returns_fixes(monkeypatch): + monkeypatch.setattr(reflective, "run_claude_p", + lambda *a, **k: "VERDICT: WORSE — dropped detail\nFIXES: restore the path; keep the flag") + is_better, reason, fixes = reflective.critic_step(_skill(), "new body", _cfg()) + assert is_better is False + assert reason.startswith("WORSE") + assert "restore the path" in fixes + + +class GroundedFake: + """3-way router: skill-run (instructions block), writer (), critic.""" + + def __init__(self, writer, skillrun, critic): + self.writer, self.skillrun, self.critic = list(writer), list(skillrun), list(critic) + self.prompts = {"writer": [], "skillrun": [], "critic": []} + + def __call__(self, prompt, model="sonnet", timeout=300): + if "--- SKILL (instructions to follow) ---" in prompt: + self.prompts["skillrun"].append(prompt) + return self.skillrun.pop(0) + if "" in prompt: + self.prompts["writer"].append(prompt) + return self.writer.pop(0) + self.prompts["critic"].append(prompt) + return self.critic.pop(0) + + +def test_run_skill_on_task(monkeypatch): + monkeypatch.setattr(reflective, "run_claude_p", lambda prompt, **k: f"ran:{'task X' in prompt}") + assert reflective.run_skill_on_task("BODY", "do task X", _cfg()) == "ran:True" + assert reflective.run_skill_on_task("BODY", " ", _cfg()) == "" # empty task → no call + + +def test_run_skill_on_task_outage(monkeypatch): + def boom(*a, **k): + raise RuntimeError("claude CLI not on PATH") + monkeypatch.setattr(reflective, "run_claude_p", boom) + assert reflective.run_skill_on_task("BODY", "task", _cfg()) == "" + + +def test_critic_step_includes_task_demo(monkeypatch): + seen = {} + + def cap(prompt, **k): + seen["p"] = prompt + return "VERDICT: BETTER — behaved well on the task" + monkeypatch.setattr(reflective, "run_claude_p", cap) + is_better, _, _ = reflective.critic_step(_skill(), "newbody", _cfg(), task_demo="THE TRANSCRIPT") + assert is_better is True + assert "THE TRANSCRIPT" in seen["p"] and "" in seen["p"] + + +def test_loop_grounds_critic_with_real_transcript(monkeypatch): + fake = GroundedFake(writer=[_wrap("clean body")], + skillrun=["TRANSCRIPT: skill ran and produced X"], + critic=["VERDICT: BETTER — good behaviour"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, + max_rounds=1, task_input="real user task") + assert out["quality_ok"] is True + assert len(fake.prompts["skillrun"]) == 1 + assert "real user task" in fake.prompts["skillrun"][0] # candidate ran on the mined task + assert "TRANSCRIPT: skill ran" in fake.prompts["critic"][0] # critic saw the transcript + + +def test_representative_task_mining(tmp_path): + from evolution.skills import evolve_skill + f = tmp_path / "analytics.jsonl" + f.write_text( + '{"event":"skill_gap","skill":"meta/foo","ts":"2026-06-01T00:00:00Z","first_prompt":"old prompt"}\n' + '{"event":"skill_gap","skill":"meta/foo","ts":"2026-06-10T00:00:00Z","first_prompt":"newer prompt"}\n' + '{"event":"skill_gap","skill":"meta/bar","ts":"2026-06-11T00:00:00Z","first_prompt":"other skill"}\n' + '{"event":"skill_hit","skill":"meta/foo","ts":"2026-06-12T00:00:00Z"}\n', + encoding="utf-8") + cfg = SimpleNamespace(analytics_log=f) + assert evolve_skill._representative_task(cfg, "meta/foo") == "newer prompt" # most recent gap + assert evolve_skill._representative_task(cfg, "meta/none") == "" # no gaps + cfg2 = SimpleNamespace(analytics_log=tmp_path / "missing.jsonl") + assert evolve_skill._representative_task(cfg2, "meta/foo") == "" # no file + + +def test_critic_equal_then_better(monkeypatch): + # round1 EQUAL (with fixes) → retry → round2 BETTER. The loop's core claim. + fake = FakeClaude(writer=[_wrap("v1 clean"), _wrap("v2 better")], + critic=["VERDICT: EQUAL — flat\nFIXES: sharpen the trigger line", + "VERDICT: BETTER — sharper now"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["quality_ok"] is True + assert out["rounds"] == 2 + assert out["body"] == "v2 better" + assert "sharpen the trigger line" in fake.writer_prompts[1] # EQUAL fixes fed forward + + +def test_all_rounds_lint_fail(monkeypatch): + fake = FakeClaude(writer=[_wrap("LINTFAIL a"), _wrap("LINTFAIL b")], critic=[]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=2) + assert out["candidate_ok"] is False + assert out["quality_ok"] is False + assert out["judge_reason"] != "" # clear log sentinel, not blank + assert len(fake.critic_prompts) == 0 # critic never runs on a lint-failing candidate + + +def test_max_rounds_one_worse_respects_budget(monkeypatch): + fake = FakeClaude(writer=[_wrap("only clean body")], critic=["VERDICT: WORSE — nope"]) + monkeypatch.setattr(reflective, "run_claude_p", fake) + out = reflective.writer_critic_loop(_skill(), _cfg(), validate_fn=_validate, max_rounds=1) + assert out["rounds"] == 1 + assert out["quality_ok"] is False + assert out["body"] == "only clean body" + assert len(fake.writer_prompts) == 1 # budget enforced — no 2nd writer call + + +def test_evolve_single_shot_integration(tmp_path, monkeypatch): + """Exercise the REAL evolve() single-shot call site with claude -p and the cue + gate mocked. The unit tests call writer_critic_loop directly and so cannot + catch a signature mismatch at the call site — this test can (and does).""" + monkeypatch.setenv("XDG_CONFIG_HOME", str(tmp_path / "cfg")) # isolate log + analytics + from evolution.skills import evolve_skill + + skill = {"name": "demo", "description": "a demo skill", "body": "OLD BODY", + "frontmatter": "---\nname: demo\ndescription: a demo skill\n---", + "raw": "---\nname: demo\ndescription: a demo skill\n---\nOLD BODY"} + monkeypatch.setattr(evolve_skill, "find_skill", lambda sid, root: tmp_path / "SKILL.md") + monkeypatch.setattr(evolve_skill, "load_skill", lambda p: skill) + + class FakeValidator: + def __init__(self, config): + pass + + def validate_all(self, body, raw, baseline_body=None): + return [SimpleNamespace(passed=True, constraint_name="lint", message="ok")] + monkeypatch.setattr(evolve_skill, "ConstraintValidator", FakeValidator) + + def fake_claude(prompt, model="sonnet", timeout=300): + return ("NEW IMPROVED BODY" if "" in prompt + else "VERDICT: BETTER — clearer") + monkeypatch.setattr(reflective, "run_claude_p", fake_claude) + + rc = evolve_skill.evolve(skill_id="meta/demo", optimizer="single-shot", + propose_only=True, cue_repo=str(tmp_path)) + assert rc == 0 # would be a TypeError crash if the call site drifted from the loop signature + props = list((tmp_path / "evolution" / "proposals").rglob("evolved_SKILL.md")) + assert props, "expected a proposal file in propose-only mode" + assert "NEW IMPROVED BODY" in props[0].read_text() + log = tmp_path / "cfg" / "cue" / "evolution-log.jsonl" + assert log.exists() + entry = json.loads(log.read_text().strip().splitlines()[-1]) + assert entry["optimizer"] == "writer-loop" and entry["applied"] is False + + +def test_back_compat_wrappers(monkeypatch): + # judge_is_better → 2-tuple; propose_improved_body → extracted body + monkeypatch.setattr(reflective, "run_claude_p", + lambda prompt, **k: ("VERDICT: BETTER — ok" if "" not in prompt + else _wrap("WRAPPED BODY"))) + is_better, reason = reflective.judge_is_better(_skill(), "x", _cfg()) + assert is_better is True and reason == "BETTER: ok" + assert reflective.propose_improved_body(_skill(), _cfg()) == "WRAPPED BODY" diff --git a/profiles/core/profile.yaml b/profiles/core/profile.yaml index f54d360f..a1677bac 100644 --- a/profiles/core/profile.yaml +++ b/profiles/core/profile.yaml @@ -302,3 +302,16 @@ hooks: # Edit-scope guard, opt-in via /freeze . No-op until # ~/.config/cue/freeze-dir exists. - freeze-edit-scope.json + # Stop hook: cue self-learner CAPTURE. Records where the active profile's skills + # fell short (friction signals + an optional once-per-session live critic) as + # skill_gap events in ~/.config/cue/analytics.jsonl. Promoted from the + # skill-writer pilot to core so every profile learns. Never blocks Stop, fail- + # open. OFF until BOTH `touch ~/.config/cue/.auto-improve-enabled` and telemetry + # consent exist. See docs/self-learner.md. + - profile-self-improve.json + # Stop hook: cue self-learner ACT. Picks the most-flagged skill from the + # captured skill_gap signals and evolves it (single-shot writer→lint→critic + # loop, propose-only). 24h cooldown, canary auto-revert, never blocks Stop. + # OFF until ALSO `touch ~/.config/cue/.auto-evolve-enabled`. See + # docs/self-evolution-upgrade-plan.md. + - auto-evolve.json diff --git a/resources/hooks/auto-evolve.sh b/resources/hooks/auto-evolve.sh index 19542c92..1efca612 100755 --- a/resources/hooks/auto-evolve.sh +++ b/resources/hooks/auto-evolve.sh @@ -26,7 +26,18 @@ CFG="${XDG_CONFIG_HOME:-$HOME/.config}/cue" [ -f "$CFG/.auto-improve-enabled" ] || exit 0 [ -f "$CFG/.auto-evolve-enabled" ] || exit 0 -EVO_DIR="${CUE_EVOLUTION_DIR:-$HOME/Documents/cue/evolution}" +# Resolve the evolution/ package dir robustly so a core-wide promotion works +# regardless of where cue is checked out: +# 1. CUE_EVOLUTION_DIR wins (explicit override). +# 2. else derive it from THIS script's real path — the materialized hook is a +# symlink back into the repo at resources/hooks/, so the repo root is two +# dirs up and the package is /evolution. python3 (already required +# below for portable mtime) does the realpath so this works on macOS too. +# 3. else the known default checkout. +_self="$(python3 -c 'import os,sys;print(os.path.realpath(sys.argv[1]))' "${BASH_SOURCE[0]}" 2>/dev/null || echo "${BASH_SOURCE[0]}")" +_repo="$(cd "$(dirname "$_self")/../.." 2>/dev/null && pwd || true)" +EVO_DIR="${CUE_EVOLUTION_DIR:-${_repo:+$_repo/evolution}}" +[ -n "$EVO_DIR" ] && [ -x "$EVO_DIR/bin/auto-evolve" ] || EVO_DIR="$HOME/Documents/cue/evolution" wrapper="$EVO_DIR/bin/auto-evolve" [ -x "$wrapper" ] || exit 0