Skip to content

feat: real subagent-driven skill self-evolution (Hermes-style), wired into core#61

Merged
NagyVikt merged 5 commits into
mainfrom
feat/skill-evolution-upgrade
Jun 13, 2026
Merged

feat: real subagent-driven skill self-evolution (Hermes-style), wired into core#61
NagyVikt merged 5 commits into
mainfrom
feat/skill-evolution-upgrade

Conversation

@NagyVikt

Copy link
Copy Markdown
Contributor

What & why

cue already ports NousResearch/hermes-agent-self-evolution into evolution/, but the loop was dormant: default-OFF, pilot-wired into skill-writer only, never run, and the automated path did a single claude -p rewrite + one text-diff judge with no iteration. This makes it real: a true writer→lint→critic loop, grounded in real usage, wired into core (still default-OFF).

What shipped (3 of the planned levers)

1. Writer→lint→critic loop (reflective.py, evolve_skill.py, config.py)
The single-shot optimizer — the path the Stop-hook loop actually runs — now iterates: the writer proposes a body, the cue gate lints it, an independent critic (a different/stronger model) judges it, and the writer retries with the lint errors and the critic's concrete fixes until BETTER or the round budget runs out (CUE_WRITER_LOOP_ROUNDS, default 2). propose_improved_body/judge_is_better kept as back-compat wrappers. Fail-soft: a claude -p outage degrades to a proposal, never crashes the unattended run.

2a. Task-grounded critic (reflective.py, evolve_skill.py)
When history exists, the loop runs the candidate skill as instructions on the most recent real user prompt that flagged this skill as a gap (mined from analytics.jsonl, DSPy-free) and feeds that transcript to the critic — so the rewrite is judged on real behaviour, not just prose. Best-effort: no history → text-only review.

3. Promote to core, default-OFF (profiles/core/profile.yaml, auto-evolve.sh)
Both self-learner Stop hooks (profile-self-improve.json capture + auto-evolve.json act) move from the skill-writer pilot into core. They exit 0 instantly when the flag files are absent, so this adds zero runtime cost until opt-in. Also fixes the auto-evolve.sh portability gap (it hard-coded ~/Documents/cue/evolution; now self-locates from the materialized hook symlink, CUE_EVOLUTION_DIR override preserved).

Deferred (documented in docs/self-evolution-upgrade-plan.md)

2b/2c — GEPA judge default + SubagentJudgeMetric. Cut on purpose: they touch only the manual GEPA path (the automated loop runs single-shot, never GEPA), and they can't be live-verified in the dev env (dspy import is broken — libstdc++.so.6). A naive default-flip would also put LLMJudge in GEPA's inner loop (~1200 calls/run). Exact change-points are recorded for a follow-up once dspy works. Stage 2a already delivers "judge real behaviour on mined tasks" for the path that runs.

How to enable (opt-in, propose-only)

cue telemetry enable                              # consent (already set on dev machine)
touch ~/.config/cue/.auto-improve-enabled         # capture skill_gap signals
touch ~/.config/cue/.auto-evolve-enabled          # allow evolution (propose-only)
# auto-apply stays OFF unless CUE_AUTO_EVOLVE_APPLY=1; 24h cooldown; canary auto-revert

Test plan

  • evolution/tests: 95 passed / 2 skipped (skips = dspy-only seams; dspy broken in dev env). 17 new writer-loop/grounding tests incl. an evolve() single-shot integration test that exercises the real call site.
  • src/lib/runtime-materializer.test.ts: 40 passed / 0 fail (hook materialization unaffected).
  • cue validate core: schema-valid, 20 hooks resolved; flags-OFF = silent exit 0.
  • Independent adversarial review (4 dimensions → verified): 1 CRITICAL found + fixed (propose_only kwarg TypeError at the real call site), HIGH/LOW test gaps closed.
  • CI is the gate for the full TS suite (won't complete locally — validate/e2e do slow submodule/registry resolution).

🤖 Generated with Claude Code

NagyVikt added 5 commits June 13, 2026 15:01
The automated Stop-hook evolver runs the single-shot optimizer, which did one
claude -p rewrite plus one text-diff judge with no iteration. Replace it with a
bounded writer→lint→critic loop (reflective.writer_critic_loop): the writer
retries with the cue lint errors and the critic's concrete fixes as feedback
until the candidate passes cue lint-skill and an independent critic (a different,
stronger model) judges it BETTER, or the round budget runs out.

- reflective.py: writer_step (lint-retry), critic_step (returns suggested_fixes),
  writer_critic_loop; propose_improved_body/judge_is_better kept as thin wrappers
  so existing callers and the hooks keep working.
- evolve_skill.py: single-shot branch drives the loop; logs optimizer:"writer-loop".
- config.py: writer_loop_rounds knob (CUE_WRITER_LOOP_ROUNDS, default 2).
- fail-soft: a claude -p outage degrades to a proposal, never crashes the run.
- tests: 8 DSPy-free seam tests covering retry, fix-feedback, and outage paths.
The critic judged only the text diff. Now, when history exists, the loop runs the
candidate skill body as instructions on the most recent real user prompt that
flagged this skill as a gap (mined from analytics.jsonl, DSPy-free) and feeds that
transcript to the critic — so the rewrite is judged on real behaviour, not prose.
This is the "score by running through a real Claude Code subagent on a mined task"
signal, delivered for the path the Stop-hook loop actually runs (single-shot).

- reflective.py: run_skill_on_task (DSPy-free, fail-soft), critic_step gains a
  task_demo arg + DEMO block, writer_critic_loop gains task_input.
- evolve_skill.py: _representative_task mines the triggering prompt; single-shot
  branch grounds the critic with it.
- tests: +5 (grounding, mining recency, outage fall-back).

GEPA judge-default + SubagentJudgeMetric (2b/2c) deferred: manual-only path,
not live-verifiable here (dspy broken). Change-points documented in the plan.
Wire the two self-learner Stop hooks into the core profile so every profile can
learn, not just the skill-writer pilot:
- profile-self-improve.json (CAPTURE: skill_gap signals + optional live critic)
- auto-evolve.json (ACT: evolve the most-flagged skill, propose-only)

Both are Stop-only and exit 0 instantly when the flag files are absent, so the
promotion adds ZERO runtime cost until a user opts in with
.auto-improve-enabled (+ telemetry consent) and .auto-evolve-enabled.

Deliberately NOT promoting learnings-surface.json — it's a SessionStart context
injection with per-session token cost (separate decision; conflicts with recent
token-reduction work).

Also fix the auto-evolve.sh portability gap: it hard-coded
~/Documents/cue/evolution as the package dir, so any other checkout silently
no-opped. Now it self-locates from the materialized hook symlink (python3
realpath → repo root → /evolution), with CUE_EVOLUTION_DIR override and the old
default as last resort.

Verified: cue validate core → schema-valid, hooks: 20 resolved; flags-OFF =
silent exit 0; symlink-resolved EVO_DIR finds bin/auto-evolve.
…op tests

Independent review caught a CRITICAL: evolve()'s single-shot branch passed
propose_only= to writer_critic_loop, which has no such parameter — a TypeError on
every real (non-dry-run) single-shot evolve. The dry-run returns before that call
site and the unit tests called the loop directly, so both missed it.

- evolve_skill.py: remove propose_only= from the writer_critic_loop call (the loop
  is propose-only-agnostic; _finalize enforces propose-only).
- reflective.py: log "lint failed; not judged" instead of an empty judge_reason
  when a candidate never passes lint — clearer evolution-log entry.
- tests: add an evolve() single-shot INTEGRATION test (mocks claude -p + the cue
  gate) exercising the real call site to catch signature drift; plus EQUAL→BETTER
  retry, all-rounds-lint-fail, and max_rounds=1 budget coverage.

Full suite: 95 passed / 2 skipped.
@NagyVikt NagyVikt merged commit 3b209da into main Jun 13, 2026
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant