Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
adversarial-review.md	adversarial-review.md
design-review.md	design-review.md
pass-fail.md	pass-fail.md
prd.md	prd.md
rubric.md	rubric.md
strategy.md	strategy.md

Evaluation System

Shipwright skills produce structured artifacts. This directory provides rubrics for evaluating whether those artifacts are actually good, not just well-formatted, but decision-ready.

Why evaluate

A PRD can follow the Working Backwards template perfectly and still be useless if it's full of vague language, missing evidence, and placeholder metrics. Structure is necessary but not sufficient. Rubrics close the gap between "correctly formatted" and "actually good."

How to use the rubrics

During a session

After a skill or workflow produces an artifact, ask the agent to score it:

Score this PRD against the evaluation rubric in evals/prd.md. Be honest.

The agent will rate each dimension, identify the weakest areas, and suggest specific improvements. This works best as a second pass, generate the artifact first, then evaluate and revise.

For team calibration

Use the rubrics to align your team on what "good" looks like. The scored examples in each rubric show the concrete difference between a 6/10 and a 9/10 artifact. Review these in a team setting to build shared quality standards.

For continuous improvement

Track scores across sessions. If your PRDs consistently score low on "evidence grounding," that's a signal to invest more in discovery before writing requirements, not a prompt engineering problem.

Enforcement first, scoring second

Use pass-fail gates before any scoring rubric.

Run pass/fail gates to determine if the artifact is acceptable or must be rewritten
If it passes, run a scoring rubric to improve quality from good to excellent

This prevents evals from becoming optional documentation.

The rubrics

Rubric	Evaluates	Artifact-specific dimensions
Pass/Fail gates	Binary readiness check	Required structure, decision frame, evidence integrity, ownership
Universal rubric	Any Shipwright artifact	Clarity, Completeness, Actionability, Correctness
PRD	PRDs from `/write-prd`	+ Scope Discipline, Evidence Grounding
Strategy	Strategy docs from `/strategy`	+ Decision Courage, Falsifiability
Design Review	Design reviews from `design-review` skill	+ Perspective Coverage, Tension Surfacing
Adversarial Review	Challenge Reports from `adversarial-review` or `red-team`	+ Finding Specificity, Severity Calibration, Evidence Grounding, Resolution Actionability

Scoring

All dimensions use a 1-10 scale. The universal rubric defines anchor points at 3 (weak), 6 (adequate), and 9 (strong) for each dimension. Artifact-specific rubrics add their own anchors.

A practical quality bar:

Fail any pass/fail gate: Rewrite required, do not ship
Pass all gates + 7+ on all rubric dimensions: Ship-ready
Pass all gates + any rubric dimension below 7: Iterate before broad sharing

Adding rubrics

To add a rubric for a new artifact type:

Create evals/<artifact>.md
Include the 4 universal dimensions (copy from rubric.md)
Add 2-3 artifact-specific dimensions with anchor descriptions
Include a scored example showing the difference between 6/10 and 9/10
Update manifest.json to register the new eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Evaluation System

Why evaluate

How to use the rubrics

During a session

For team calibration

For continuous improvement

Enforcement first, scoring second

The rubrics

Scoring

Adding rubrics

FilesExpand file tree

evals

Directory actions

More options

Directory actions

More options

Latest commit

History

evals

Folders and files

parent directory

README.md

Evaluation System

Why evaluate

How to use the rubrics

During a session

For team calibration

For continuous improvement

Enforcement first, scoring second

The rubrics

Scoring

Adding rubrics