Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/self-evolution-upgrade-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Plan: Hermes-style skill self-evolution — full upgrade

> Status: **in progress** on branch `feat/skill-evolution-upgrade` (worktree-isolated).
> Builds on the existing `evolution/` package (ported from NousResearch/hermes-agent-self-evolution)
> and the `docs/self-learner.md` Stop-hook loop. The loop already exists; this makes it
> real, well-judged, and wired into `core` — default-OFF, propose-only.

## Why this work

The Hermes-ported evolver is complete but **dormant**: default-OFF, pilot-wired into
`skill-writer` only, never run (0 `skill_gap` events on the dev machine), and carrying two
documented quality caveats (keyword-overlap metric by default; eval scores a synthetic proxy,
not real Claude Code behaviour). The **automated** path uses the `single-shot` optimizer
(`auto_evolve.py` → `evolve(optimizer="single-shot")`), *not* GEPA — so a one-shot rewrite +
one text-diff judge is all that runs today.

## Resolved decisions

- **D1 — Writer-critic loop, default 2 rounds** (`CUE_WRITER_LOOP_ROUNDS`). Retry on `WORSE`
always; on `EQUAL` only when the critic returned actionable fixes and rounds remain.
`propose_improved_body`/`judge_is_better` stay as back-compat wrappers (hooks import by name).
- **D2 — Task-grounded critic (DSPy-free):** critic runs ONE mined task through the candidate
skill via `run_claude_p`, then judges the real transcript with the existing `run_claude_p`
judge prompt. Soft-falls-back to text-diff review. 1 subagent call/round — not the
20×/iteration cost bomb.
- **D3 — Judge defaults (GEPA/holdout path):** acceptance/holdout metric default `overlap → judge`
(LLMJudge); GEPA *inner* metric stays `overlap` (cost), `judge` opt-in; new `--metric subagent`
is holdout-only, cost-flagged, soft-fallback. `--eval-source` default `synthetic → auto`.
- **D4 — Activation:** wire `profile-self-improve.json` + `auto-evolve.json` + `learnings-surface.json`
into `core`, default-OFF behind the flag files; fix the `CUE_EVOLUTION_DIR` portability gap.
Enabling on a machine is a separate explicit step, propose-only.
- **D5 — propose-only everywhere.** No auto-apply enabled by this work.

## Stages (each independently verifiable + revertable)

| Stage | Work | Verify |
|---|---|---|
| 0 | Worktree + baseline | tests green (78p/2s) + `auto_evolve --dry-run` ✅ |
| 1 | Writer-critic loop (`reflective.py`, `evolve_skill.py`, `config.py`) | retry-logic unit test; `evolve <skill> --propose-only` → lint-passing proposal logged `optimizer:"writer-loop"` |
| 2a | **Task-grounded critic** (`reflective.py`, `evolve_skill.py`) — critic runs the candidate on a real mined task (`run_claude_p`) and judges the transcript | ✅ done: grounding + mining unit tests; loop feeds the critic a real transcript |
| 2b/2c | GEPA `judge` default + `SubagentJudgeMetric` (holdout) | **deferred** — see below |
| 3 | Activate into `core` (`profiles/core/profile.yaml`, `auto-evolve.sh`) | `cue validate` clean; materialized `settings.json` shows both Stop hooks; flags-OFF = no-op |
| 4 | Review + ship | no CRITICAL/HIGH; full suite green; gated PR |

## Deferred: Stage 2b/2c (GEPA judge default + subagent holdout metric)

Cut from this pass on purpose — they only touch the **manual GEPA** path (the
automated Stop-hook loop runs `single-shot`, never GEPA), and they **cannot be
live-verified in this environment** (`dspy` import is broken). Spirit of "default
to the LLM judge on real behaviour" is already delivered for the path that runs
by Stage 2a. Ready-to-execute change-points when `dspy` works:

- **2b — default the holdout/acceptance metric to `judge`.** Do NOT naively flip
`evolve_skill.py:~205` `metric_mode` default `overlap → judge`: that puts
`LLMJudge` in GEPA's *inner* loop (~`max_metric_calls` calls/run — a cost bomb).
Instead split it: keep GEPA's inner `fitness_metric` on `overlap`, and build a
separate `holdout_metric = make_judge_metric(config, skill_text=...)`
(`CUE_EVOLVE_HOLDOUT_METRIC`, default `judge`, soft-fallback) used only in the
holdout loop at `evolve_skill.py:~410-413`.
- **2c — `SubagentJudgeMetric`** (`fitness.py`, beside `make_judge_metric`): a
metric that runs the candidate through `run_claude_p` on a holdout example and
feeds the transcript to `LLMJudge.score()`. Holdout-ONLY (one subprocess/example
≈120s); never pass it as the GEPA inner metric. Soft-fallback to overlap when
`claude` is absent. `--eval-source` default `synthetic → auto` (sessiondb then
synthetic) at `evolve_skill.py:~155`.

## Environment notes

- `evolution/.venv` is gitignored → absent in a worktree. Run edited source via the main venv:
`PYTHONPATH=<worktree>/evolution /home/deadpool/Documents/cue/evolution/.venv/bin/python -m ...`
(PYTHONPATH shadows the editable main install — verified).
- `dspy` import is broken in this env (`libstdc++.so.6` missing), so live GEPA/LLMJudge can't run
here. Stage 1 is DSPy-free and fully verifiable; Stage 2 relies on the repo's existing dspy-mock
seam tests. Fixing dspy = a system-lib install (out of scope, network/sudo).
6 changes: 6 additions & 0 deletions evolution/evolution/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ class CueEvolutionConfig:
# Optimization parameters
iterations: int = 10
population_size: int = 5
# Single-shot writer->lint->critic loop: how many writer rounds to spend
# repairing lint failures / acting on critic fixes before giving up. 1 = the
# old one-shot behaviour. Env-overridable so the Stop-hook loop can tune it.
writer_loop_rounds: int = field(
default_factory=lambda: max(1, int(os.getenv("CUE_WRITER_LOOP_ROUNDS", "2")))
)

# LLM configuration (provider inferred from the string prefix by LiteLLM)
optimizer_model: str = _DEFAULT_OPTIMIZER_MODEL
Expand Down
114 changes: 85 additions & 29 deletions evolution/evolution/skills/evolve_skill.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,40 @@ def _finalize(config, skill_id, skill, skill_path, evolved_body, candidate_ok,
return 0


def _representative_task(config, skill_id: str) -> str:
"""The most recent real user prompt that triggered a skill_gap for this skill,
mined from ~/.config/cue/analytics.jsonl (DSPy-free, stdlib only).

This is what grounds the critic in GENUINE Claude Code usage: the writer's
rewrite is judged by how it behaves on the very task that exposed the gap.
Returns "" when there's no usable history (fresh machine, or the gap carried
no first_prompt) — the critic then falls back to text-only review.
"""
path = config.analytics_log
if not path.exists():
return ""
best_ts, best_prompt = "", ""
try:
with open(path, encoding="utf-8") as f:
for line in f:
if '"skill_gap"' not in line:
continue
try:
ev = json.loads(line)
except json.JSONDecodeError:
continue
if ev.get("event") != "skill_gap" or ev.get("skill") != skill_id:
continue
fp = (ev.get("first_prompt") or "").strip()
ts = ev.get("ts", "")
# ISO-8601 ts strings sort lexicographically by recency.
if fp and ts >= best_ts:
best_ts, best_prompt = ts, fp
except OSError:
return ""
return best_prompt


def evolve(
skill_id: str,
iterations: int = 10,
Expand Down Expand Up @@ -214,8 +248,8 @@ def evolve(
console.print("\n[bold green]DRY RUN — cue wiring validated.[/bold green]")
console.print(f" Optimizer: {optimizer}")
if optimizer == "single-shot":
console.print(f" Would propose an improved body in 1 `claude -p` call "
f"({claude_or_model(config)}), no DSPy/dataset")
console.print(f" Would run a writer→lint→critic loop (≤{config.writer_loop_rounds} "
f"round(s)) of `claude -p` calls ({claude_or_model(config)}), no DSPy/dataset")
else:
console.print(f" Would build eval dataset (source: {eval_source})")
console.print(f" Would run GEPA ({iterations} iters, optimizer={config.optimizer_model})")
Expand All @@ -226,39 +260,61 @@ def evolve(
console.print(f" Backups + log → {config.evolution_log}")
return 0

# ── Single-shot optimizer: one claude -p call, no DSPy, no dataset, no key ──
# ── Single-shot optimizer: a short writer→lint→critic loop of claude -p ──
# calls, no DSPy, no dataset, no key. This is the path the Stop-hook loop
# runs. The writer proposes a body, the cue gate lints it, an INDEPENDENT
# critic (reviewer_model, not the writer) judges it, and the writer retries
# with the lint errors + critic fixes until BETTER or the round budget ends.
if optimizer == "single-shot":
from evolution.skills.reflective import propose_improved_body, judge_is_better
console.print(f"\n[bold cyan]Single-shot reflective improve[/bold cyan] "
f"({claude_or_model(config)})...")
evolved_body = propose_improved_body(skill, config)
evolved_full = reassemble_skill(skill["frontmatter"], evolved_body)
console.print("\n[bold]Candidate constraints[/bold]")
candidate_results = validator.validate_all(
evolved_body, evolved_full, baseline_body=skill["body"])
candidate_ok = _print_constraints(candidate_results)

# Quality gate: an INDEPENDENT reviewer (config.reviewer_model, not the
# proposer) judges evolved vs baseline, fed the deterministic gate results
# as evidence. Auto-apply only on a BETTER verdict (skip the call in
# propose-only or when nothing changed — then it can't apply anyway).
quality_ok, judge_reason = None, ""
changed = evolved_body.strip() != skill["body"].strip()
if not propose_only and candidate_ok and changed:
from evolution.skills.reflective import writer_critic_loop
console.print(f"\n[bold cyan]Writer→lint→critic loop[/bold cyan] "
f"(≤{config.writer_loop_rounds} round(s), {claude_or_model(config)} writer / "
f"{claude_model_name(config.reviewer_model)} critic)...")

def _validate(body: str) -> dict:
"""Run the full cue constraint gate on a candidate body and package
the result for the loop: pass/fail, the per-constraint evidence the
critic reads, and the failing-constraint messages the writer repairs."""
full = reassemble_skill(skill["frontmatter"], body)
results = validator.validate_all(body, full, baseline_body=skill["body"])
ok = all(c.passed for c in results)
evidence = "; ".join(f"{c.constraint_name}: {'pass' if c.passed else 'FAIL'}"
for c in candidate_results)
console.print(f"[bold]Independent review[/bold] ({claude_model_name(config.reviewer_model)}, "
f"evolved vs baseline)...")
quality_ok, judge_reason = judge_is_better(
skill, evolved_body, config, evidence=evidence)
console.print(f" {'✓' if quality_ok else '✗'} {judge_reason}")
for c in results)
lint_errors = "; ".join(f"{c.constraint_name}: {c.message}"
for c in results if not c.passed)
return {"ok": ok, "results": results, "evidence": evidence, "lint_errors": lint_errors}

# Ground the critic in real usage: the most recent task that flagged this
# skill as a gap (mined from analytics.jsonl). "" → text-only review.
task_input = _representative_task(config, skill_id)
if task_input:
console.print(f" [dim]grounding critic on a real mined task "
f"({len(task_input)} chars)[/dim]")

# The loop is propose-only-agnostic: it always iterates writer→critic to
# produce the best proposal; _finalize (below) is what refuses to APPLY
# when propose_only is set. So propose_only is NOT passed to the loop.
loop = writer_critic_loop(
skill, config, validate_fn=_validate, max_rounds=config.writer_loop_rounds,
task_input=task_input, console=console)
evolved_body = loop["body"]
console.print("\n[bold]Final candidate constraints[/bold]")
candidate_ok = (_print_constraints(loop["results"]) if loop["results"] is not None
else False)

# The loop always returns an explicit quality_ok bool from the critic
# (or False when nothing changed / no candidate), so _finalize never falls
# through to its no-judge branch. In propose-only nothing is applied
# regardless, but the verdict is still logged for review.
quality_ok = loop["quality_ok"]
judge_reason = loop["judge_reason"]

return _finalize(
config, skill_id, skill, skill_path, evolved_body, candidate_ok, improvement=None,
quality_ok=quality_ok, propose_only=propose_only,
extra_meta={"optimizer": "single-shot", "optimizer_model": config.optimizer_model,
"baseline_size": len(skill["body"]), "evolved_size": len(evolved_body),
"judge": judge_reason})
extra_meta={"optimizer": "writer-loop", "optimizer_model": config.optimizer_model,
"rounds": loop["rounds"], "baseline_size": len(skill["body"]),
"evolved_size": len(evolved_body), "judge": judge_reason})

# ── Heavy imports happen ONLY for a real GEPA run ────────────────────
try:
Expand Down
Loading
Loading