Shipwright skills produce structured artifacts. This directory provides rubrics for evaluating whether those artifacts are actually good, not just well-formatted, but decision-ready.
A PRD can follow the Working Backwards template perfectly and still be useless if it's full of vague language, missing evidence, and placeholder metrics. Structure is necessary but not sufficient. Rubrics close the gap between "correctly formatted" and "actually good."
After a skill or workflow produces an artifact, ask the agent to score it:
Score this PRD against the evaluation rubric in evals/prd.md. Be honest.
The agent will rate each dimension, identify the weakest areas, and suggest specific improvements. This works best as a second pass, generate the artifact first, then evaluate and revise.
Use the rubrics to align your team on what "good" looks like. The scored examples in each rubric show the concrete difference between a 6/10 and a 9/10 artifact. Review these in a team setting to build shared quality standards.
Track scores across sessions. If your PRDs consistently score low on "evidence grounding," that's a signal to invest more in discovery before writing requirements, not a prompt engineering problem.
Use pass-fail gates before any scoring rubric.
- Run pass/fail gates to determine if the artifact is acceptable or must be rewritten
- If it passes, run a scoring rubric to improve quality from good to excellent
This prevents evals from becoming optional documentation.
| Rubric | Evaluates | Artifact-specific dimensions |
|---|---|---|
| Pass/Fail gates | Binary readiness check | Required structure, decision frame, evidence integrity, ownership |
| Universal rubric | Any Shipwright artifact | Clarity, Completeness, Actionability, Correctness |
| PRD | PRDs from /write-prd |
+ Scope Discipline, Evidence Grounding |
| Strategy | Strategy docs from /strategy |
+ Decision Courage, Falsifiability |
| Design Review | Design reviews from design-review skill |
+ Perspective Coverage, Tension Surfacing |
| Adversarial Review | Challenge Reports from adversarial-review or red-team |
+ Finding Specificity, Severity Calibration, Evidence Grounding, Resolution Actionability |
All dimensions use a 1-10 scale. The universal rubric defines anchor points at 3 (weak), 6 (adequate), and 9 (strong) for each dimension. Artifact-specific rubrics add their own anchors.
A practical quality bar:
- Fail any pass/fail gate: Rewrite required, do not ship
- Pass all gates + 7+ on all rubric dimensions: Ship-ready
- Pass all gates + any rubric dimension below 7: Iterate before broad sharing
To add a rubric for a new artifact type:
- Create
evals/<artifact>.md - Include the 4 universal dimensions (copy from
rubric.md) - Add 2-3 artifact-specific dimensions with anchor descriptions
- Include a scored example showing the difference between 6/10 and 9/10
- Update
manifest.jsonto register the new eval