skill-eval

Do agent skills actually help? We tested 123 to find out.

A self-evolving evaluation engine that blind-tests ClawHub agent skills against a no-skill baseline. Every skill gets a standardized score card, and the methodology improves with each batch.

Key Findings

123 skills evaluated using Claude Opus 4
76% are genuinely useful (score 7+, "Recommended")
92% show positive quality delta over baseline
Only 4% are actively harmful (score < 3)
The #1 improvement lever is removing content, not adding it

Live Leaderboard

jensen-srp.github.io/skill-eval

Browse all 123 skill cards, filter by verdict, sort by score.

Methodology

Blind A/B testing with two-layer assertions:

Layer 1 (Deterministic): File checks, keyword presence, format compliance
Layer 2 (Rubric-based): LLM-as-judge quality scoring with specific rubrics

Each skill runs the same prompts with and without the skill loaded. The only variable is whether the skill is present. Scores combine quality (0-5), value-add delta (0-3), and efficiency (0-2).

Read the full methodology

Scoring

Score	Verdict	Meaning
7-10	Recommended	Clear value over baseline
5-6.9	Conditional	Some value with trade-offs
3-4.9	Marginal	Overhead without proportional improvement
0-2.9	Not Recommended	Baseline is comparable or better

Top 10

Rank	Skill	Score	Delta
1	Explain Code	10.0	+50%
2	BookNotes	9.6	+88%
3	Self-Improving with Reflection	9.5	+100%
4	EasyDoc Parse	9.5	+100%
5	BusAPI Agent Marketplace	9.5	+100%
6	Binance Web3 API	9.5	+100%
7	Article Writer (Chinese Viral)	9.5	+100%
8	KU Portal	9.5	+100%
9	Scan to Skill	9.5	+100%
10	GitVerse	9.5	+100%

Self-Evolving Engine

The engine improves its own methodology after each evaluation batch:

v0.1.0 -- Basic blind A/B, 7 phases
v0.2.0 -- Dependency handling, category-specific assertions
v0.3.0 -- Anti-pattern detection, skill improvement pipeline
v0.4.0 -- Multi-model support, self-evolving improvement engine, 12 phases

Structure

skill-cards/       -- 123 evaluation reports (one per skill)
blog/              -- Methodology blog post
index.html         -- Interactive leaderboard

For Skill Authors

Your skill has been evaluated. Find your report in skill-cards/<your-skill>.md.

Tips for higher scores:

Be opinionated -- define specific formats, banned words, required sections
Don't teach -- instruct. Use MUST/ALWAYS/NEVER, not tutorials
Keep it under 100 lines
Handle errors gracefully
Don't reference tools that don't exist

Built with skill-eval v0.4.0 | ClawHub

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
blog		blog
skill-cards		skill-cards
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-eval

Key Findings

Live Leaderboard

Methodology

Scoring

Top 10

Self-Evolving Engine

Structure

For Skill Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-eval

Key Findings

Live Leaderboard

Methodology

Scoring

Top 10

Self-Evolving Engine

Structure

For Skill Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages