Skip to content

skill-eval: data-model-designer 5.5 -> 7.0/10 (eval report + improved SKILL.md)#174

Closed
jensen-srp wants to merge 4 commits intoopenclaw:mainfrom
jensen-srp:skill-eval/data-model-designer
Closed

skill-eval: data-model-designer 5.5 -> 7.0/10 (eval report + improved SKILL.md)#174
jensen-srp wants to merge 4 commits intoopenclaw:mainfrom
jensen-srp:skill-eval/data-model-designer

Conversation

@jensen-srp
Copy link

skill-eval Report: data-model-designer

Original Score: 5.5/10 (Conditional) -> Improved: 7.0/10 (Recommended)

Blind A/B evaluation of data-model-designer by @datadrivenconstruction using skill-eval, an automated evaluation engine that tested 123 ClawHub skills.

Finding

The current SKILL.md is an 11KB Python class definition (ConstructionDataModel). When loaded, the model tries to use these classes, burning +83% time / +113% tokens without producing better schemas than baseline.

Improvement

Rewrote SKILL.md as a behavioral contract:

  • Removed all Python code (the model already knows how to code)
  • Added mandatory output sections: Requirements Summary, ER Diagram, SQL DDL, Validation Checklist, Sample Queries
  • Added SQL style rules and explicit prohibitions
  • Result: overhead dropped, score jumped from 5.5 to 7.0

Files added

File Description
EVAL-REPORT.md Original blind A/B evaluation
EVAL-IMPROVED.md Before/after comparison card
SKILL-improved.md Proposed improved SKILL.md
IMPROVEMENT-LOG.md Detailed changelog with rationale

Methodology

  • Blind A/B: same prompts tested with and without skill loaded
  • Two-layer assertions: deterministic checks + rubric-based quality scoring
  • Model: Claude Opus 4
  • Full methodology

Leaderboard

All 123 evaluated skills: jensen-srp.github.io/skill-eval


Filed by skill-eval v0.4.0

@openclaw-barnacle
Copy link

Thanks for the pull request! This repository is read-only and is automatically synced from https://clawhub.ai, so we can’t accept changes here. Please make updates on the website instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant