Skip to content

Commit 4f03d1d

Browse files
Finalize 1.0.0. Dashboard and examples. (Upload Haiku 4.5 and rnj-1:8b) (#8)
* chore: test * chore: Finalise benchmark page and workflow. Add codex. * clean up * fix orphan history * commit results * structure tests history * summary and data-history rewire * fix orphan branch * fix publish * fix judge * fix prompt * fix docs * update bench * update copilot * chore: Refresh test cases and create gemini * fix judging ollama * Finalize dashboard, docs, versioning. copilot fixes.
1 parent b0931c9 commit 4f03d1d

File tree

126 files changed

+5891
-2483
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+5891
-2483
lines changed

.github/workflows/benchmark-dashboard.yml

Lines changed: 10 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -21,88 +21,40 @@ on:
2121
type: string
2222

2323
jobs:
24-
run-benchmark:
24+
run-benchmark-and-publish:
2525
runs-on: ubuntu-latest
2626

2727
steps:
2828
- uses: actions/checkout@v4
2929
with:
3030
ref: main
31-
path: workspace
31+
path: repo
3232

3333
- uses: astral-sh/setup-uv@v5
3434
with:
3535
enable-cache: true
3636

3737
- name: Run benchmark
38-
working-directory: workspace
38+
working-directory: repo
3939
env:
4040
OLLAMA_API_KEY: ${{ secrets.OLLAMA_API_KEY }}
4141
run: |
42-
# Install dependencies
4342
uv sync --project tests
44-
# Run benchmark
4543
uv run --project tests tests/evaluator.py \
4644
--provider ${{ inputs.provider }} \
4745
--model ${{ inputs.model }} \
4846
--judge \
4947
--verbose \
50-
--report
51-
# Rename artifact for clarity
52-
if [ -d tests/results ]; then
53-
ARTIFACT_NAME="benchmark-${{ inputs.provider }}-${{ inputs.model }}-$(date +%Y%m%d-%H%M%S)"
54-
mv tests/results "tests/${ARTIFACT_NAME}"
55-
fi
56-
# Create docs/benchmarks if it doesn't exist for publish_benchmarks.py
57-
mkdir -p docs/benchmarks
48+
--report \
49+
--all
5850
59-
- name: Generate dashboard
60-
working-directory: workspace
51+
- name: Publish to benchmark-history branch
52+
working-directory: repo
6153
run: |
54+
# Run publish_benchmarks.py with correct paths
6255
uv run --project tests python3 ci/publish_benchmarks.py \
6356
--provider ${{ inputs.provider }} \
6457
--model ${{ inputs.model }} \
6558
--branch benchmark-history \
66-
--no-benchmark
67-
68-
- name: Generate dashboard
69-
working-directory: workspace
70-
run: |
71-
uv run --project tests python3 ci/publish_benchmarks.py \
72-
--provider ${{ inputs.provider }} \
73-
--model ${{ inputs.model }} \
74-
--branch benchmark-history
75-
76-
deploy-pages:
77-
needs: run-benchmark
78-
runs-on: ubuntu-latest
79-
80-
steps:
81-
- name: Checkout workspace
82-
uses: actions/checkout@v4
83-
with:
84-
ref: main
85-
path: workspace
86-
87-
- name: Checkout benchmark data
88-
uses: actions/checkout@v4
89-
with:
90-
ref: benchmark-history
91-
path: benchmark-data
92-
93-
- name: Copy results to docs
94-
run: |
95-
mkdir -p workspace/docs/benchmarks
96-
cp benchmark-data/docs/benchmarks.json workspace/docs/benchmarks.json 2>/dev/null || true
97-
cp benchmark-data/docs/index.html workspace/docs/index.html 2>/dev/null || true
98-
# Also copy individual benchmark results if they exist
99-
cp -r benchmark-data/docs/benchmarks/*.json workspace/docs/benchmarks/ 2>/dev/null || true
100-
101-
- name: Commit and push updates
102-
working-directory: workspace
103-
run: |
104-
git config user.name "GitHub Actions"
105-
git config user.email "actions@github.com"
106-
git add docs/
107-
git commit -m "Update benchmark data" || echo "No changes to commit"
108-
git push origin HEAD:benchmark-history
59+
--no-benchmark \
60+
--output-dir "site/benchmarks"

.github/workflows/skill-validation.yml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -107,11 +107,6 @@ jobs:
107107
# Construct arguments array for safety
108108
ARGS=(--provider "${{ matrix.provider }}" --model "${{ matrix.model }}" --judge --verbose --report --threshold 50)
109109
110-
if [ -n "${{ matrix.extra_args }}" ]; then
111-
# Split extra_args safely if needed, but for now assuming simple flags
112-
ARGS+=(${{ matrix.extra_args }})
113-
fi
114-
115110
if [ -n "${{ matrix.skill }}" ]; then
116111
ARGS+=(--skill "${{ matrix.skill }}")
117112
else
@@ -128,7 +123,7 @@ jobs:
128123
uses: actions/upload-artifact@v4
129124
with:
130125
name: ${{ steps.artifact.outputs.name }}
131-
path: pr-code/tests/results/
126+
path: pr-code/tests/data-history/
132127
retention-days: 1
133128

134129
consolidate:
@@ -145,10 +140,10 @@ jobs:
145140

146141
- uses: actions/download-artifact@v4
147142
with:
148-
path: pr-code/tests/results/
143+
path: pr-code/tests/data-history/
149144

150145
- name: Consolidate results
151-
run: python3 trusted-scripts/ci/consolidate_results.py --results-dir pr-code/tests/results --output-file pr-code/comment.md
146+
run: python3 trusted-scripts/ci/consolidate_results.py --results-dir pr-code/tests/data-history --output-file pr-code/comment.md
152147

153148
- name: Post to PR
154149
uses: marocchino/sticky-pull-request-comment@v2

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,6 @@ scratch-*
4646

4747
# CI/Validation Artifacts
4848
comment.md
49-
results/
49+
50+
# Benchmark site output
51+
site/

README.md

Lines changed: 23 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
Language-agnostic AI agent skills that enforce fundamental programming principles. This repository provides specific, granular instructions that enable AI coding assistants to produce significantly higher-quality code that adheres to robust engineering standards.
44

5+
| Dashboard Explorer | Code Comparison | Judge Reasoning |
6+
| :---: | :---: | :---: |
7+
| ![Dashboard](docs/img/dashboard.png) | ![Code Comparison](docs/img/compare-code.png) | ![Judge Results](docs/img/compare-judge-results.png) |
8+
59
Adopting these skills measurably changes the output of AI models, shifting them from generating merely functional code to producing architecturally sound solutions.
610

711
## Table of Contents
@@ -15,66 +19,50 @@ Adopting these skills measurably changes the output of AI models, shifting them
1519

1620
## Installation
1721

18-
Select your platform for specific setup instructions:
22+
See:
1923

20-
- [Cursor](docs/install/cursor.md)
21-
- [Antigravity](docs/install/antigravity.md)
22-
- [GitHub Copilot](docs/install/copilot.md)
23-
- [Claude](docs/install/claude.md)
24+
- [Install Instructions](docs/install-instructions.md)
2425

2526
## How it Works
2627

2728
The core of this repository is the `skills/` directory. Each skill is encapsulated in its own subdirectory following the `ps-<name>` convention (e.g., `ps-composition-over-coordination`).
2829

2930
We use this granular structure because:
31+
3032
1. **Focus**: It allows the AI to load only the relevant context for a specific task, avoiding context window pollution.
3133
2. **Modularity**: Skills can be improved, versioned, and tested independently.
3234
3. **Composability**: Users can select the specific combination of principles they want to enforce for their project.
3335

36+
## Skill Integration
37+
38+
Skills should live under the `skills/` directory as `SKILL.md` files. For a full integration guide and documentation index:
39+
40+
```
41+
https://agentskills.io/integrate-skills
42+
https://agentskills.io/llms.txt
43+
```
44+
3445
## Validation & Testing
3546

3647
Every skill is validated against a rigorous testing suite found in the `tests/` directory.
3748

3849
- **Automated Judging**: We use an LLM-as-a-Judge approach. The system compares the output of a "Baseline" model (without the skill) against a "Skill" model (with the skill loaded).
39-
- **Semantics over Syntax**: The test does not just look for passing unit tests; it analyzes the *logic* and *structure* of the code.
50+
- **Semantics over Syntax**: The test does not just look for passing unit tests; it analyzes the _logic_ and _structure_ of the code.
4051
- **Evidence-Based**: The judge identifies the specific lines of code that demonstrate adherence to or violation of the principle.
4152

4253
[Read our Case Study on Judge Fairness](docs/judge-fairness-case-study.md) to see how the system fairly evaluates architectural quality, even when it means failing the Skill model.
4354

4455
## Evaluation Results
4556

46-
Processed 24 evaluation(s).
47-
48-
| Test Name | Model | Baseline | With Skill | Cases Pass | Winner |
49-
|-----------|-------|----------|------------|------------|--------|
50-
| [results-ollama-devstral-small-2--24b-cloud-ps-composition-over-coordination](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329534502) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | N/A |
51-
| [results-ollama-devstral-small-2--24b-cloud-ps-error-handling-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329534340) | devstral-small-2:24b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
52-
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-boundaries-adapters](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329535854) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
53-
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-ownership-lifecycle](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329535894) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | With Skill |
54-
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-state-invariants](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329537422) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
55-
| [results-ollama-devstral-small-2--24b-cloud-ps-functional-core-imperative-shell](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329537046) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
56-
| [results-ollama-devstral-small-2--24b-cloud-ps-illegal-states-unrepresentable](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329538523) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
57-
| [results-ollama-devstral-small-2--24b-cloud-ps-local-reasoning](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329538780) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
58-
| [results-ollama-devstral-small-2--24b-cloud-ps-minimize-mutation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329540068) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | N/A |
59-
| [results-ollama-devstral-small-2--24b-cloud-ps-naming-as-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329540040) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
60-
| [results-ollama-devstral-small-2--24b-cloud-ps-policy-mechanism-separation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329541792) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
61-
| [results-ollama-devstral-small-2--24b-cloud-ps-single-direction-data-flow](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329541535) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
62-
| [results-ollama-rnj-1--8b-cloud-ps-composition-over-coordination](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329524580) | rnj-1:8b-cloud | outstanding | good | ❌ 2/2 | Baseline |
63-
| [results-ollama-rnj-1--8b-cloud-ps-error-handling-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329524126) | rnj-1:8b-cloud | vague | outstanding | ✅ 2/2 | With Skill |
64-
| [results-ollama-rnj-1--8b-cloud-ps-explicit-boundaries-adapters](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329526125) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
65-
| [results-ollama-rnj-1--8b-cloud-ps-explicit-ownership-lifecycle](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329526263) | rnj-1:8b-cloud | good | outstanding | ✅ 2/2 | With Skill |
66-
| [results-ollama-rnj-1--8b-cloud-ps-explicit-state-invariants](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329528479) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
67-
| [results-ollama-rnj-1--8b-cloud-ps-functional-core-imperative-shell](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329527817) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
68-
| [results-ollama-rnj-1--8b-cloud-ps-illegal-states-unrepresentable](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329529527) | rnj-1:8b-cloud | outstanding | outstanding | ✅ 2/2 | N/A |
69-
| [results-ollama-rnj-1--8b-cloud-ps-local-reasoning](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329529241) | rnj-1:8b-cloud | vague | outstanding | ✅ 2/2 | With Skill |
70-
| [results-ollama-rnj-1--8b-cloud-ps-minimize-mutation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329531124) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
71-
| [results-ollama-rnj-1--8b-cloud-ps-naming-as-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329531393) | rnj-1:8b-cloud | vague | good | ✅ 2/2 | With Skill |
72-
| [results-ollama-rnj-1--8b-cloud-ps-policy-mechanism-separation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329532599) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
73-
| [results-ollama-rnj-1--8b-cloud-ps-single-direction-data-flow](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329532551) | rnj-1:8b-cloud | vague | good | ✅ 2/2 | With Skill |
57+
Dashboard:
58+
59+
```
60+
https://ariel-rodriguez.github.io/programming-skills/
61+
```
7462

7563
## Documentation
7664

77-
- [Architecture](docs/architecture.md) - Repository design & structure
65+
- [Architecture](docs/specs/architecture.md) - Repository design & structure
7866
- [Contributing](docs/contributing.md) - How to add/modify skills & benchmarks
7967
- [AI Prompt Wrapper](docs/ai-prompt-wrapper.md) - Configure your AI assistant
8068
- [Changelog](CHANGELOG.md) - Version history & skill changes

ci/consolidate_results.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def main():
113113
parser = argparse.ArgumentParser(description="Consolidate evaluation results")
114114
parser.add_argument("--mode", choices=["pr-comment", "benchmark"], default="pr-comment",
115115
help="Output mode: pr-comment or benchmark")
116-
parser.add_argument("--results-dir", type=Path, default="tests/results",
116+
parser.add_argument("--results-dir", type=Path, default="tests/data-history",
117117
help="Directory containing evaluation results")
118118
parser.add_argument("--output-dir", type=Path, default=None,
119119
help="Output directory for benchmark mode")
@@ -126,8 +126,8 @@ def main():
126126

127127
print(f"==> Consolidating results (mode: {args.mode})")
128128

129-
# Find all summary.json files
130-
summary_files = sorted(args.results_dir.glob("*/summary.json"))
129+
# Find all summary files
130+
summary_files = sorted(args.results_dir.glob("**/summary-*.json"))
131131

132132
if not summary_files:
133133
print(f"No results found in {args.results_dir}")

0 commit comments

Comments
 (0)