opencue · NagyVikt · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/docs/self-evolution-upgrade-plan.md b/docs/self-evolution-upgrade-plan.md
@@ -0,0 +1,74 @@
+# Plan: Hermes-style skill self-evolution — full upgrade
+
+> Status: **in progress** on branch `feat/skill-evolution-upgrade` (worktree-isolated).
+> Builds on the existing `evolution/` package (ported from NousResearch/hermes-agent-self-evolution)
+> and the `docs/self-learner.md` Stop-hook loop. The loop already exists; this makes it
+> real, well-judged, and wired into `core` — default-OFF, propose-only.
+
+## Why this work
+
+The Hermes-ported evolver is complete but **dormant**: default-OFF, pilot-wired into
+`skill-writer` only, never run (0 `skill_gap` events on the dev machine), and carrying two
+documented quality caveats (keyword-overlap metric by default; eval scores a synthetic proxy,
+not real Claude Code behaviour). The **automated** path uses the `single-shot` optimizer
+(`auto_evolve.py` → `evolve(optimizer="single-shot")`), *not* GEPA — so a one-shot rewrite +
+one text-diff judge is all that runs today.
+
+## Resolved decisions
+
+- **D1 — Writer-critic loop, default 2 rounds** (`CUE_WRITER_LOOP_ROUNDS`). Retry on `WORSE`
+  always; on `EQUAL` only when the critic returned actionable fixes and rounds remain.
+  `propose_improved_body`/`judge_is_better` stay as back-compat wrappers (hooks import by name).
+- **D2 — Task-grounded critic (DSPy-free):** critic runs ONE mined task through the candidate
+  skill via `run_claude_p`, then judges the real transcript with the existing `run_claude_p`
+  judge prompt. Soft-falls-back to text-diff review. 1 subagent call/round — not the
+  20×/iteration cost bomb.
+- **D3 — Judge defaults (GEPA/holdout path):** acceptance/holdout metric default `overlap → judge`
+  (LLMJudge); GEPA *inner* metric stays `overlap` (cost), `judge` opt-in; new `--metric subagent`
+  is holdout-only, cost-flagged, soft-fallback. `--eval-source` default `synthetic → auto`.
+- **D4 — Activation:** wire `profile-self-improve.json` + `auto-evolve.json` + `learnings-surface.json`
+  into `core`, default-OFF behind the flag files; fix the `CUE_EVOLUTION_DIR` portability gap.
+  Enabling on a machine is a separate explicit step, propose-only.
+- **D5 — propose-only everywhere.** No auto-apply enabled by this work.
+
+## Stages (each independently verifiable + revertable)
+
+| Stage | Work | Verify |
+|---|---|---|
+| 0 | Worktree + baseline | tests green (78p/2s) + `auto_evolve --dry-run` ✅ |
+| 1 | Writer-critic loop (`reflective.py`, `evolve_skill.py`, `config.py`) | retry-logic unit test; `evolve <skill> --propose-only` → lint-passing proposal logged `optimizer:"writer-loop"` |
+| 2a | **Task-grounded critic** (`reflective.py`, `evolve_skill.py`) — critic runs the candidate on a real mined task (`run_claude_p`) and judges the transcript | ✅ done: grounding + mining unit tests; loop feeds the critic a real transcript |
+| 2b/2c | GEPA `judge` default + `SubagentJudgeMetric` (holdout) | **deferred** — see below |
+| 3 | Activate into `core` (`profiles/core/profile.yaml`, `auto-evolve.sh`) | `cue validate` clean; materialized `settings.json` shows both Stop hooks; flags-OFF = no-op |
+| 4 | Review + ship | no CRITICAL/HIGH; full suite green; gated PR |
+
+## Deferred: Stage 2b/2c (GEPA judge default + subagent holdout metric)
+
+Cut from this pass on purpose — they only touch the **manual GEPA** path (the
+automated Stop-hook loop runs `single-shot`, never GEPA), and they **cannot be
+live-verified in this environment** (`dspy` import is broken). Spirit of "default
+to the LLM judge on real behaviour" is already delivered for the path that runs
+by Stage 2a. Ready-to-execute change-points when `dspy` works:
+
+- **2b — default the holdout/acceptance metric to `judge`.** Do NOT naively flip
+  `evolve_skill.py:~205` `metric_mode` default `overlap → judge`: that puts
+  `LLMJudge` in GEPA's *inner* loop (~`max_metric_calls` calls/run — a cost bomb).
+  Instead split it: keep GEPA's inner `fitness_metric` on `overlap`, and build a
+  separate `holdout_metric = make_judge_metric(config, skill_text=...)`
+  (`CUE_EVOLVE_HOLDOUT_METRIC`, default `judge`, soft-fallback) used only in the
+  holdout loop at `evolve_skill.py:~410-413`.
+- **2c — `SubagentJudgeMetric`** (`fitness.py`, beside `make_judge_metric`): a
+  metric that runs the candidate through `run_claude_p` on a holdout example and
+  feeds the transcript to `LLMJudge.score()`. Holdout-ONLY (one subprocess/example
+  ≈120s); never pass it as the GEPA inner metric. Soft-fallback to overlap when
+  `claude` is absent. `--eval-source` default `synthetic → auto` (sessiondb then
+  synthetic) at `evolve_skill.py:~155`.
+
+## Environment notes
+
+- `evolution/.venv` is gitignored → absent in a worktree. Run edited source via the main venv:
+  `PYTHONPATH=<worktree>/evolution /home/deadpool/Documents/cue/evolution/.venv/bin/python -m ...`
+  (PYTHONPATH shadows the editable main install — verified).
+- `dspy` import is broken in this env (`libstdc++.so.6` missing), so live GEPA/LLMJudge can't run
+  here. Stage 1 is DSPy-free and fully verifiable; Stage 2 relies on the repo's existing dspy-mock
+  seam tests. Fixing dspy = a system-lib install (out of scope, network/sudo).
diff --git a/evolution/evolution/core/config.py b/evolution/evolution/core/config.py
@@ -45,6 +45,12 @@ class CueEvolutionConfig:
     # Optimization parameters
     iterations: int = 10
     population_size: int = 5
+    # Single-shot writer->lint->critic loop: how many writer rounds to spend
+    # repairing lint failures / acting on critic fixes before giving up. 1 = the
+    # old one-shot behaviour. Env-overridable so the Stop-hook loop can tune it.
+    writer_loop_rounds: int = field(
+        default_factory=lambda: max(1, int(os.getenv("CUE_WRITER_LOOP_ROUNDS", "2")))
+    )
 
     # LLM configuration (provider inferred from the string prefix by LiteLLM)
     optimizer_model: str = _DEFAULT_OPTIMIZER_MODEL

diff --git a/evolution/evolution/skills/evolve_skill.py b/evolution/evolution/skills/evolve_skill.py
@@ -145,6 +145,40 @@ def _finalize(config, skill_id, skill, skill_path, evolved_body, candidate_ok,
     return 0
 
 
+def _representative_task(config, skill_id: str) -> str:
+    """The most recent real user prompt that triggered a skill_gap for this skill,
+    mined from ~/.config/cue/analytics.jsonl (DSPy-free, stdlib only).
+
+    This is what grounds the critic in GENUINE Claude Code usage: the writer's
+    rewrite is judged by how it behaves on the very task that exposed the gap.
+    Returns "" when there's no usable history (fresh machine, or the gap carried
+    no first_prompt) — the critic then falls back to text-only review.
+    """
+    path = config.analytics_log
+    if not path.exists():
+        return ""
+    best_ts, best_prompt = "", ""
+    try:
+        with open(path, encoding="utf-8") as f:
+            for line in f:
+                if '"skill_gap"' not in line:
+                    continue
+                try:
+                    ev = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if ev.get("event") != "skill_gap" or ev.get("skill") != skill_id:
+                    continue
+                fp = (ev.get("first_prompt") or "").strip()
+                ts = ev.get("ts", "")
+                # ISO-8601 ts strings sort lexicographically by recency.
+                if fp and ts >= best_ts:
+                    best_ts, best_prompt = ts, fp
+    except OSError:
+        return ""
+    return best_prompt
+
+
 def evolve(
     skill_id: str,
     iterations: int = 10,
@@ -214,8 +248,8 @@ def evolve(
         console.print("\n[bold green]DRY RUN — cue wiring validated.[/bold green]")
         console.print(f"  Optimizer: {optimizer}")
         if optimizer == "single-shot":
-            console.print(f"  Would propose an improved body in 1 `claude -p` call "
-                          f"({claude_or_model(config)}), no DSPy/dataset")
+            console.print(f"  Would run a writer→lint→critic loop (≤{config.writer_loop_rounds} "
+                          f"round(s)) of `claude -p` calls ({claude_or_model(config)}), no DSPy/dataset")
         else:
             console.print(f"  Would build eval dataset (source: {eval_source})")
             console.print(f"  Would run GEPA ({iterations} iters, optimizer={config.optimizer_model})")
@@ -226,39 +260,61 @@ def evolve(
         console.print(f"  Backups + log → {config.evolution_log}")
         return 0
 
-    # ── Single-shot optimizer: one claude -p call, no DSPy, no dataset, no key ──
+    # ── Single-shot optimizer: a short writer→lint→critic loop of claude -p ──
+    #    calls, no DSPy, no dataset, no key. This is the path the Stop-hook loop
+    #    runs. The writer proposes a body, the cue gate lints it, an INDEPENDENT
+    #    critic (reviewer_model, not the writer) judges it, and the writer retries
+    #    with the lint errors + critic fixes until BETTER or the round budget ends.
     if optimizer == "single-shot":
-        from evolution.skills.reflective import propose_improved_body, judge_is_better
-        console.print(f"\n[bold cyan]Single-shot reflective improve[/bold cyan] "
-                      f"({claude_or_model(config)})...")
-        evolved_body = propose_improved_body(skill, config)
-        evolved_full = reassemble_skill(skill["frontmatter"], evolved_body)
-        console.print("\n[bold]Candidate constraints[/bold]")
-        candidate_results = validator.validate_all(
-            evolved_body, evolved_full, baseline_body=skill["body"])
-        candidate_ok = _print_constraints(candidate_results)
-
-        # Quality gate: an INDEPENDENT reviewer (config.reviewer_model, not the
-        # proposer) judges evolved vs baseline, fed the deterministic gate results
-        # as evidence. Auto-apply only on a BETTER verdict (skip the call in
-        # propose-only or when nothing changed — then it can't apply anyway).
-        quality_ok, judge_reason = None, ""
-        changed = evolved_body.strip() != skill["body"].strip()
-        if not propose_only and candidate_ok and changed:
+        from evolution.skills.reflective import writer_critic_loop
+        console.print(f"\n[bold cyan]Writer→lint→critic loop[/bold cyan] "
+                      f"(≤{config.writer_loop_rounds} round(s), {claude_or_model(config)} writer / "
+                      f"{claude_model_name(config.reviewer_model)} critic)...")
+
+        def _validate(body: str) -> dict:
+            """Run the full cue constraint gate on a candidate body and package
+            the result for the loop: pass/fail, the per-constraint evidence the
+            critic reads, and the failing-constraint messages the writer repairs."""
+            full = reassemble_skill(skill["frontmatter"], body)
+            results = validator.validate_all(body, full, baseline_body=skill["body"])
+            ok = all(c.passed for c in results)
             evidence = "; ".join(f"{c.constraint_name}: {'pass' if c.passed else 'FAIL'}"
-                                 for c in candidate_results)
-            console.print(f"[bold]Independent review[/bold] ({claude_model_name(config.reviewer_model)}, "
-                          f"evolved vs baseline)...")
-            quality_ok, judge_reason = judge_is_better(
-                skill, evolved_body, config, evidence=evidence)
-            console.print(f"  {'✓' if quality_ok else '✗'} {judge_reason}")
+                                 for c in results)
+            lint_errors = "; ".join(f"{c.constraint_name}: {c.message}"
+                                    for c in results if not c.passed)
+            return {"ok": ok, "results": results, "evidence": evidence, "lint_errors": lint_errors}
+
+        # Ground the critic in real usage: the most recent task that flagged this
+        # skill as a gap (mined from analytics.jsonl). "" → text-only review.
+        task_input = _representative_task(config, skill_id)
+        if task_input:
+            console.print(f"  [dim]grounding critic on a real mined task "
+                          f"({len(task_input)} chars)[/dim]")
+
+        # The loop is propose-only-agnostic: it always iterates writer→critic to
+        # produce the best proposal; _finalize (below) is what refuses to APPLY
+        # when propose_only is set. So propose_only is NOT passed to the loop.
+        loop = writer_critic_loop(
+            skill, config, validate_fn=_validate, max_rounds=config.writer_loop_rounds,
+            task_input=task_input, console=console)
+        evolved_body = loop["body"]
+        console.print("\n[bold]Final candidate constraints[/bold]")
+        candidate_ok = (_print_constraints(loop["results"]) if loop["results"] is not None
+                        else False)
+
+        # The loop always returns an explicit quality_ok bool from the critic
+        # (or False when nothing changed / no candidate), so _finalize never falls
+        # through to its no-judge branch. In propose-only nothing is applied
+        # regardless, but the verdict is still logged for review.
+        quality_ok = loop["quality_ok"]
+        judge_reason = loop["judge_reason"]
 
         return _finalize(
             config, skill_id, skill, skill_path, evolved_body, candidate_ok, improvement=None,
             quality_ok=quality_ok, propose_only=propose_only,
-            extra_meta={"optimizer": "single-shot", "optimizer_model": config.optimizer_model,
-                        "baseline_size": len(skill["body"]), "evolved_size": len(evolved_body),
-                        "judge": judge_reason})
+            extra_meta={"optimizer": "writer-loop", "optimizer_model": config.optimizer_model,
+                        "rounds": loop["rounds"], "baseline_size": len(skill["body"]),
+                        "evolved_size": len(evolved_body), "judge": judge_reason})
 
     # ── Heavy imports happen ONLY for a real GEPA run ────────────────────
     try: