feat: real subagent-driven skill self-evolution (Hermes-style), wired into core#61
Merged
Conversation
added 5 commits
June 13, 2026 15:01
The automated Stop-hook evolver runs the single-shot optimizer, which did one claude -p rewrite plus one text-diff judge with no iteration. Replace it with a bounded writer→lint→critic loop (reflective.writer_critic_loop): the writer retries with the cue lint errors and the critic's concrete fixes as feedback until the candidate passes cue lint-skill and an independent critic (a different, stronger model) judges it BETTER, or the round budget runs out. - reflective.py: writer_step (lint-retry), critic_step (returns suggested_fixes), writer_critic_loop; propose_improved_body/judge_is_better kept as thin wrappers so existing callers and the hooks keep working. - evolve_skill.py: single-shot branch drives the loop; logs optimizer:"writer-loop". - config.py: writer_loop_rounds knob (CUE_WRITER_LOOP_ROUNDS, default 2). - fail-soft: a claude -p outage degrades to a proposal, never crashes the run. - tests: 8 DSPy-free seam tests covering retry, fix-feedback, and outage paths.
The critic judged only the text diff. Now, when history exists, the loop runs the candidate skill body as instructions on the most recent real user prompt that flagged this skill as a gap (mined from analytics.jsonl, DSPy-free) and feeds that transcript to the critic — so the rewrite is judged on real behaviour, not prose. This is the "score by running through a real Claude Code subagent on a mined task" signal, delivered for the path the Stop-hook loop actually runs (single-shot). - reflective.py: run_skill_on_task (DSPy-free, fail-soft), critic_step gains a task_demo arg + DEMO block, writer_critic_loop gains task_input. - evolve_skill.py: _representative_task mines the triggering prompt; single-shot branch grounds the critic with it. - tests: +5 (grounding, mining recency, outage fall-back). GEPA judge-default + SubagentJudgeMetric (2b/2c) deferred: manual-only path, not live-verifiable here (dspy broken). Change-points documented in the plan.
Wire the two self-learner Stop hooks into the core profile so every profile can learn, not just the skill-writer pilot: - profile-self-improve.json (CAPTURE: skill_gap signals + optional live critic) - auto-evolve.json (ACT: evolve the most-flagged skill, propose-only) Both are Stop-only and exit 0 instantly when the flag files are absent, so the promotion adds ZERO runtime cost until a user opts in with .auto-improve-enabled (+ telemetry consent) and .auto-evolve-enabled. Deliberately NOT promoting learnings-surface.json — it's a SessionStart context injection with per-session token cost (separate decision; conflicts with recent token-reduction work). Also fix the auto-evolve.sh portability gap: it hard-coded ~/Documents/cue/evolution as the package dir, so any other checkout silently no-opped. Now it self-locates from the materialized hook symlink (python3 realpath → repo root → /evolution), with CUE_EVOLUTION_DIR override and the old default as last resort. Verified: cue validate core → schema-valid, hooks: 20 resolved; flags-OFF = silent exit 0; symlink-resolved EVO_DIR finds bin/auto-evolve.
…op tests Independent review caught a CRITICAL: evolve()'s single-shot branch passed propose_only= to writer_critic_loop, which has no such parameter — a TypeError on every real (non-dry-run) single-shot evolve. The dry-run returns before that call site and the unit tests called the loop directly, so both missed it. - evolve_skill.py: remove propose_only= from the writer_critic_loop call (the loop is propose-only-agnostic; _finalize enforces propose-only). - reflective.py: log "lint failed; not judged" instead of an empty judge_reason when a candidate never passes lint — clearer evolution-log entry. - tests: add an evolve() single-shot INTEGRATION test (mocks claude -p + the cue gate) exercising the real call site to catch signature drift; plus EQUAL→BETTER retry, all-rounds-lint-fail, and max_rounds=1 budget coverage. Full suite: 95 passed / 2 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
cue already ports NousResearch/hermes-agent-self-evolution into
evolution/, but the loop was dormant: default-OFF, pilot-wired intoskill-writeronly, never run, and the automated path did a singleclaude -prewrite + one text-diff judge with no iteration. This makes it real: a true writer→lint→critic loop, grounded in real usage, wired intocore(still default-OFF).What shipped (3 of the planned levers)
1. Writer→lint→critic loop (
reflective.py,evolve_skill.py,config.py)The single-shot optimizer — the path the Stop-hook loop actually runs — now iterates: the writer proposes a body, the cue gate lints it, an independent critic (a different/stronger model) judges it, and the writer retries with the lint errors and the critic's concrete fixes until BETTER or the round budget runs out (
CUE_WRITER_LOOP_ROUNDS, default 2).propose_improved_body/judge_is_betterkept as back-compat wrappers. Fail-soft: aclaude -poutage degrades to a proposal, never crashes the unattended run.2a. Task-grounded critic (
reflective.py,evolve_skill.py)When history exists, the loop runs the candidate skill as instructions on the most recent real user prompt that flagged this skill as a gap (mined from
analytics.jsonl, DSPy-free) and feeds that transcript to the critic — so the rewrite is judged on real behaviour, not just prose. Best-effort: no history → text-only review.3. Promote to
core, default-OFF (profiles/core/profile.yaml,auto-evolve.sh)Both self-learner Stop hooks (
profile-self-improve.jsoncapture +auto-evolve.jsonact) move from the skill-writer pilot intocore. They exit 0 instantly when the flag files are absent, so this adds zero runtime cost until opt-in. Also fixes theauto-evolve.shportability gap (it hard-coded~/Documents/cue/evolution; now self-locates from the materialized hook symlink,CUE_EVOLUTION_DIRoverride preserved).Deferred (documented in
docs/self-evolution-upgrade-plan.md)2b/2c — GEPA
judgedefault +SubagentJudgeMetric. Cut on purpose: they touch only the manual GEPA path (the automated loop runssingle-shot, never GEPA), and they can't be live-verified in the dev env (dspyimport is broken —libstdc++.so.6). A naive default-flip would also putLLMJudgein GEPA's inner loop (~1200 calls/run). Exact change-points are recorded for a follow-up oncedspyworks. Stage 2a already delivers "judge real behaviour on mined tasks" for the path that runs.How to enable (opt-in, propose-only)
Test plan
evolution/tests: 95 passed / 2 skipped (skips = dspy-only seams; dspy broken in dev env). 17 new writer-loop/grounding tests incl. anevolve()single-shot integration test that exercises the real call site.src/lib/runtime-materializer.test.ts: 40 passed / 0 fail (hook materialization unaffected).cue validate core: schema-valid, 20 hooks resolved; flags-OFF = silent exit 0.propose_onlykwarg TypeError at the real call site), HIGH/LOW test gaps closed.validate/e2edo slow submodule/registry resolution).🤖 Generated with Claude Code