EvoSkill: Automated Skill Discovery for Coding Agents

EvoSkill: Automated Skill Discovery for Coding Agents

Supercharge your coding agents with EvoSkill, an agent-agnostic toolkit for automatically creating and improving AI skills, compatible with Claude Code, OpenCode, OpenHands, Goose, and more.

EvoSkill uses GEPA/DSPy-style self-improvement algorithms that identify agent failure patterns, propose skill or prompt improvements, evaluate the changes, and keep the best-performing variants, similar to Karpathy's autoresearch.

Install into any coding agent in seconds, and supercharge it with AI-created skills automatically. Depending on the agent, you are free to use any model provider of your choice (OpenRouter, Anthropic, OpenAI, Fireworks, and more) and any model you want (Claude, GLM, Minimax, Kimi, GPT, Gemini, Qwen, and others).

🤖 Supported agents

Agent	Support	Notes
Claude Code	✅
OpenCode	✅
OpenHands	🛠️
Goose	🛠️
Codex CLI	🛠️

🎨 Features

Capability	Status	Explanation
Evolution with a benchmark	✅	Skills can be effectively improved against your own or academic benchmarks.
Cross-agent transferability	✅	Skills are packaged as reusable folders with instructions, metadata, and helper scripts, compatible with many coding agents.
Cross-model transferability	✅	Demonstrated in EvoSkills, skills evolved with a fixed LLM can transfer their performance increase to other LLMs.
Cross-task transferability	✅	Generated skills can be generic enough to transfer across tasks, for instance a SealQA skill improving BrowseComp performance (as shown in EvoSkill).
Evolution without a benchmark	🛠️	An open research direction where benchmarks are generated on the fly (ex. Hermes-Agent self-evolution).
Continuous evolution	🛠️	Integrating the ability to improve skills from regular usage.

Installation

Requirements:

Python 3.12+
uv (recommended) or pip

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

API key:

export ANTHROPIC_API_KEY=your-key-here

Quickstart

1. Initialize a project

Run evoskill init inside any git repository:

$ evoskill init

  EvoSkill — Project Setup
  Which harness? › claude
  Evolution mode? › skill_only — agent learns new skills (recommended)
  Dataset path? › ./data/questions.csv
  Question column name? › question
  Ground truth column name? › answer
  Category column name? (leave blank if none) ›

This creates .evoskill/config.toml and .evoskill/task.md.

2. Describe your task

Edit .evoskill/task.md to describe what the agent should do:

# Task

Answer questions about quarterly financial reports.
Return only the numeric answer with units.

## Examples
- "What was revenue in Q3?" → "$4.2B"

---

# Constraints
- Always include units in the answer
- Do not explain your reasoning, just return the answer

3. Run the loop

evoskill run

EvoSkill will run the evolutionary loop and print a live progress table:

  Iter  Accuracy  Δ          Skills  Frontier  Status
  1     42.0%     —          0       [1]       baseline
  2     51.3%     +9.3%      1       [1, 2]    ★ new best
  3     49.7%     -1.6%      1       [1, 2]    discarded
  ...

4. Evaluate and inspect

evoskill eval          # score the best program on the validation set
evoskill skills        # list all discovered skills
evoskill diff          # see what changed vs baseline
evoskill logs          # view past run history

5. Use the best program

After the loop finishes, the best program lives on a git branch:

git branch | grep program/     # list all program branches
git checkout program/iter-skill-3   # switch to the best one

From there you can inspect what the loop discovered:

cat .claude/program.yaml       # system prompt, tools, score
ls .claude/skills/             # all learned skills

Copy .claude/program.yaml and .claude/skills/ into your deployment to use the evolved agent configuration.

CLI Reference

Command	Description
`evoskill init`	Initialize a new project (creates `.evoskill/`)
`evoskill run`	Run the self-improvement loop
`evoskill eval`	Evaluate the best program on the validation set
`evoskill skills`	List all skills discovered so far
`evoskill diff`	Diff baseline vs best, or between two iterations
`evoskill logs`	Show recent run history
`evoskill reset`	Delete all program branches and start fresh

`evoskill run`

evoskill run [--continue] [--verbose] [--quiet]

Flag	Description
`--continue`	Resume from the existing frontier instead of starting fresh. Preserves all `program/` branches, `frontier/` tags, feedback history, and the sampling checkpoint so the loop picks up exactly where it left off.
`--verbose`	Show per-sample pass/fail results
`--quiet`	Show the progress table only, suppress proposer output

`evoskill diff`

evoskill diff              # baseline → current best
evoskill diff 3 7          # iteration 3 vs iteration 7

The diff is scoped to the .claude/ directory — it shows changes to skills and the system prompt, not your source code.

`evoskill logs`

evoskill logs              # last 5 runs (default)
evoskill logs --last 10    # last 10 runs

`evoskill reset`

evoskill reset             # prompts for confirmation

Deletes all program/* branches, frontier/* tags, the loop checkpoint, and feedback history. Your source code, config.toml, task.md, and any skills in .claude/skills/ are left untouched.

Configuration Reference

evoskill init creates .evoskill/config.toml. All fields are optional — defaults are shown below.

[harness]
name = "claude"        # "claude" or "opencode"
model = "sonnet"       # model alias or full model ID (e.g. "claude-sonnet-4-6")
data_dirs = []         # extra directories the agent can read

[evolution]
mode = "skill_only"          # "skill_only" or "prompt_only"
iterations = 20
frontier_size = 3
concurrency = 4
no_improvement_limit = 5

[dataset]
path = "data/questions.csv"  # relative to .evoskill/, or absolute
question_column = "question"
ground_truth_column = "ground_truth"
category_column = ""         # optional, for stratified sampling
train_ratio = 0.18
val_ratio = 0.12

[scorer]
type = "multi_tolerance"     # see scorer types below

Scorer types

Type	Description
`multi_tolerance`	Flexible string matching: exact, numeric tolerance, list overlap (default)
`exact`	Case-insensitive exact string match
`llm`	LLM-as-judge grading with a custom rubric
`script`	Shell script scorer — receives `{predicted}` and `{expected}` as variables

LLM scorer options:

[scorer]
type = "llm"
rubric = "Award 1.0 if the answer is numerically correct within 5%, 0.0 otherwise."
model = "claude-sonnet-4-6"   # defaults to claude-sonnet-4-6
provider = "anthropic"        # "anthropic", "openai", or "google"

Script scorer options:

[scorer]
type = "script"
command = "python score.py --predicted {predicted} --expected {expected}"

How It Works

The self-improvement loop follows five stages:

Base Agent — Attempts benchmark questions using the current best program (system prompt + skills).
Proposer — Analyzes failure cases and proposes targeted skill or prompt changes to address them.
Generator — Creates the proposed changes: writes new skill files or rewrites the system prompt.
Evaluator — Scores the new program variant on a held-out validation set to measure improvement.
Frontier — Tracks the top-N performing programs as git branches; the best survive to the next iteration.

This cycle repeats for a configurable number of iterations, automatically converging on stronger agent configurations.

Git Branches

EvoSkill uses your repo's git history to version every program it creates. During a run it automatically creates and switches between branches — you don't need to do anything. After a run your branch layout will look like:

main                      ← your code, untouched
program/base              ← initial baseline agent
program/iter-skill-1      ← after iteration 1
program/iter-skill-2      ← after iteration 2
...

Frontier members are marked with frontier/* tags. EvoSkill only ever writes to branches prefixed program/, so there is no risk of it touching your working branch.

When the Loop Gets Stuck

If accuracy stops improving, try the following:

Check the feedback log — .claude/feedback_history.md records what the proposer tried each iteration and why it succeeded or failed.
Resume instead of restarting — evoskill run --continue picks up from the last frontier rather than discarding progress.
Reset and start fresh — evoskill reset clears all branches and lets you start over with a revised task.md.

Python API

For programmatic usage, EvoSkill exposes a high-level Python API.

`EvoSkill`

from src.api import EvoSkill

evo = EvoSkill(
    task="sealqa",
    model="sonnet",
    mode="skill_only",
    max_iterations=20,
    frontier_size=3,
    concurrency=4,
    train_ratio=0.18,
    val_ratio=0.12,
    continue_mode=False,
)
result = await evo.run()

# Synchronous usage
result = EvoSkill(task="base").run_sync()

`EvalRunner`

from src.api import EvalRunner

summary = await EvalRunner(
    task="sealqa",
    model="sonnet",
    max_concurrent=8,
).run()

Citation

If you use EvoSkill in your research, please cite the original paper:

@misc{alzubi2026evoskillautomatedskilldiscovery,
      title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems}, 
      author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham and Weiyuan Chen and Tu Vu},
      year={2026},
      eprint={2603.02766},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.02766}, 
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.claude/skills		.claude/skills
assets		assets
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evoskill_tech_report.pdf		evoskill_tech_report.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoSkill: Automated Skill Discovery for Coding Agents

🤖 Supported agents

🎨 Features

Table of Contents

Installation

Quickstart

1. Initialize a project

2. Describe your task

3. Run the loop

4. Evaluate and inspect

5. Use the best program

CLI Reference

`evoskill run`

`evoskill diff`

`evoskill logs`

`evoskill reset`

Configuration Reference

Scorer types

How It Works

Git Branches

When the Loop Gets Stuck

Python API

`EvoSkill`

`EvalRunner`

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoSkill: Automated Skill Discovery for Coding Agents

🤖 Supported agents

🎨 Features

Table of Contents

Installation

Quickstart

1. Initialize a project

2. Describe your task

3. Run the loop

4. Evaluate and inspect

5. Use the best program

CLI Reference

evoskill run

evoskill diff

evoskill logs

evoskill reset

Configuration Reference

Scorer types

How It Works

Git Branches

When the Loop Gets Stuck

Python API

EvoSkill

EvalRunner

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`evoskill run`

`evoskill diff`

`evoskill logs`

`evoskill reset`

`EvoSkill`

`EvalRunner`

Packages