feat: real subagent-driven skill self-evolution (Hermes-style), wired into core by NagyVikt · Pull Request #61 · opencue/cuecards

NagyVikt · 2026-06-13T13:27:33Z

What & why

cue already ports NousResearch/hermes-agent-self-evolution into evolution/, but the loop was dormant: default-OFF, pilot-wired into skill-writer only, never run, and the automated path did a single claude -p rewrite + one text-diff judge with no iteration. This makes it real: a true writer→lint→critic loop, grounded in real usage, wired into core (still default-OFF).

What shipped (3 of the planned levers)

1. Writer→lint→critic loop (reflective.py, evolve_skill.py, config.py)
The single-shot optimizer — the path the Stop-hook loop actually runs — now iterates: the writer proposes a body, the cue gate lints it, an independent critic (a different/stronger model) judges it, and the writer retries with the lint errors and the critic's concrete fixes until BETTER or the round budget runs out (CUE_WRITER_LOOP_ROUNDS, default 2). propose_improved_body/judge_is_better kept as back-compat wrappers. Fail-soft: a claude -p outage degrades to a proposal, never crashes the unattended run.

2a. Task-grounded critic (reflective.py, evolve_skill.py)
When history exists, the loop runs the candidate skill as instructions on the most recent real user prompt that flagged this skill as a gap (mined from analytics.jsonl, DSPy-free) and feeds that transcript to the critic — so the rewrite is judged on real behaviour, not just prose. Best-effort: no history → text-only review.

3. Promote to core, default-OFF (profiles/core/profile.yaml, auto-evolve.sh)
Both self-learner Stop hooks (profile-self-improve.json capture + auto-evolve.json act) move from the skill-writer pilot into core. They exit 0 instantly when the flag files are absent, so this adds zero runtime cost until opt-in. Also fixes the auto-evolve.sh portability gap (it hard-coded ~/Documents/cue/evolution; now self-locates from the materialized hook symlink, CUE_EVOLUTION_DIR override preserved).

Deferred (documented in `docs/self-evolution-upgrade-plan.md`)

2b/2c — GEPA judge default + SubagentJudgeMetric. Cut on purpose: they touch only the manual GEPA path (the automated loop runs single-shot, never GEPA), and they can't be live-verified in the dev env (dspy import is broken — libstdc++.so.6). A naive default-flip would also put LLMJudge in GEPA's inner loop (~1200 calls/run). Exact change-points are recorded for a follow-up once dspy works. Stage 2a already delivers "judge real behaviour on mined tasks" for the path that runs.

How to enable (opt-in, propose-only)

cue telemetry enable                              # consent (already set on dev machine)
touch ~/.config/cue/.auto-improve-enabled         # capture skill_gap signals
touch ~/.config/cue/.auto-evolve-enabled          # allow evolution (propose-only)
# auto-apply stays OFF unless CUE_AUTO_EVOLVE_APPLY=1; 24h cooldown; canary auto-revert

Test plan

evolution/tests: 95 passed / 2 skipped (skips = dspy-only seams; dspy broken in dev env). 17 new writer-loop/grounding tests incl. an evolve() single-shot integration test that exercises the real call site.
src/lib/runtime-materializer.test.ts: 40 passed / 0 fail (hook materialization unaffected).
cue validate core: schema-valid, 20 hooks resolved; flags-OFF = silent exit 0.
Independent adversarial review (4 dimensions → verified): 1 CRITICAL found + fixed (propose_only kwarg TypeError at the real call site), HIGH/LOW test gaps closed.
CI is the gate for the full TS suite (won't complete locally — validate/e2e do slow submodule/registry resolution).

🤖 Generated with Claude Code

The automated Stop-hook evolver runs the single-shot optimizer, which did one claude -p rewrite plus one text-diff judge with no iteration. Replace it with a bounded writer→lint→critic loop (reflective.writer_critic_loop): the writer retries with the cue lint errors and the critic's concrete fixes as feedback until the candidate passes cue lint-skill and an independent critic (a different, stronger model) judges it BETTER, or the round budget runs out. - reflective.py: writer_step (lint-retry), critic_step (returns suggested_fixes), writer_critic_loop; propose_improved_body/judge_is_better kept as thin wrappers so existing callers and the hooks keep working. - evolve_skill.py: single-shot branch drives the loop; logs optimizer:"writer-loop". - config.py: writer_loop_rounds knob (CUE_WRITER_LOOP_ROUNDS, default 2). - fail-soft: a claude -p outage degrades to a proposal, never crashes the run. - tests: 8 DSPy-free seam tests covering retry, fix-feedback, and outage paths.

The critic judged only the text diff. Now, when history exists, the loop runs the candidate skill body as instructions on the most recent real user prompt that flagged this skill as a gap (mined from analytics.jsonl, DSPy-free) and feeds that transcript to the critic — so the rewrite is judged on real behaviour, not prose. This is the "score by running through a real Claude Code subagent on a mined task" signal, delivered for the path the Stop-hook loop actually runs (single-shot). - reflective.py: run_skill_on_task (DSPy-free, fail-soft), critic_step gains a task_demo arg + DEMO block, writer_critic_loop gains task_input. - evolve_skill.py: _representative_task mines the triggering prompt; single-shot branch grounds the critic with it. - tests: +5 (grounding, mining recency, outage fall-back). GEPA judge-default + SubagentJudgeMetric (2b/2c) deferred: manual-only path, not live-verifiable here (dspy broken). Change-points documented in the plan.

Wire the two self-learner Stop hooks into the core profile so every profile can learn, not just the skill-writer pilot: - profile-self-improve.json (CAPTURE: skill_gap signals + optional live critic) - auto-evolve.json (ACT: evolve the most-flagged skill, propose-only) Both are Stop-only and exit 0 instantly when the flag files are absent, so the promotion adds ZERO runtime cost until a user opts in with .auto-improve-enabled (+ telemetry consent) and .auto-evolve-enabled. Deliberately NOT promoting learnings-surface.json — it's a SessionStart context injection with per-session token cost (separate decision; conflicts with recent token-reduction work). Also fix the auto-evolve.sh portability gap: it hard-coded ~/Documents/cue/evolution as the package dir, so any other checkout silently no-opped. Now it self-locates from the materialized hook symlink (python3 realpath → repo root → /evolution), with CUE_EVOLUTION_DIR override and the old default as last resort. Verified: cue validate core → schema-valid, hooks: 20 resolved; flags-OFF = silent exit 0; symlink-resolved EVO_DIR finds bin/auto-evolve.

…op tests Independent review caught a CRITICAL: evolve()'s single-shot branch passed propose_only= to writer_critic_loop, which has no such parameter — a TypeError on every real (non-dry-run) single-shot evolve. The dry-run returns before that call site and the unit tests called the loop directly, so both missed it. - evolve_skill.py: remove propose_only= from the writer_critic_loop call (the loop is propose-only-agnostic; _finalize enforces propose-only). - reflective.py: log "lint failed; not judged" instead of an empty judge_reason when a candidate never passes lint — clearer evolution-log entry. - tests: add an evolve() single-shot INTEGRATION test (mocks claude -p + the cue gate) exercising the real call site to catch signature drift; plus EQUAL→BETTER retry, all-rounds-lint-fail, and max_rounds=1 budget coverage. Full suite: 95 passed / 2 skipped.

NagyVikt added 5 commits June 13, 2026 15:01

docs(evolution): add Hermes-style self-evolution upgrade plan

c38205c

NagyVikt merged commit 3b209da into main Jun 13, 2026
3 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: real subagent-driven skill self-evolution (Hermes-style), wired into core#61

feat: real subagent-driven skill self-evolution (Hermes-style), wired into core#61
NagyVikt merged 5 commits into
mainfrom
feat/skill-evolution-upgrade

NagyVikt commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NagyVikt commented Jun 13, 2026

What & why

What shipped (3 of the planned levers)

Deferred (documented in docs/self-evolution-upgrade-plan.md)

How to enable (opt-in, propose-only)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deferred (documented in `docs/self-evolution-upgrade-plan.md`)