Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 10 additions & 58 deletions .github/workflows/benchmark-dashboard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,88 +21,40 @@ on:
type: string

jobs:
run-benchmark:
run-benchmark-and-publish:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
with:
ref: main
path: workspace
path: repo

- uses: astral-sh/setup-uv@v5
with:
enable-cache: true

- name: Run benchmark
working-directory: workspace
working-directory: repo
env:
OLLAMA_API_KEY: ${{ secrets.OLLAMA_API_KEY }}
run: |
# Install dependencies
uv sync --project tests
# Run benchmark
uv run --project tests tests/evaluator.py \
--provider ${{ inputs.provider }} \
--model ${{ inputs.model }} \
--judge \
--verbose \
--report
# Rename artifact for clarity
if [ -d tests/results ]; then
ARTIFACT_NAME="benchmark-${{ inputs.provider }}-${{ inputs.model }}-$(date +%Y%m%d-%H%M%S)"
mv tests/results "tests/${ARTIFACT_NAME}"
fi
# Create docs/benchmarks if it doesn't exist for publish_benchmarks.py
mkdir -p docs/benchmarks
--report \
--all

- name: Generate dashboard
working-directory: workspace
- name: Publish to benchmark-history branch
working-directory: repo
run: |
# Run publish_benchmarks.py with correct paths
uv run --project tests python3 ci/publish_benchmarks.py \
--provider ${{ inputs.provider }} \
--model ${{ inputs.model }} \
--branch benchmark-history \
--no-benchmark

- name: Generate dashboard
working-directory: workspace
run: |
uv run --project tests python3 ci/publish_benchmarks.py \
--provider ${{ inputs.provider }} \
--model ${{ inputs.model }} \
--branch benchmark-history

deploy-pages:
needs: run-benchmark
runs-on: ubuntu-latest

steps:
- name: Checkout workspace
uses: actions/checkout@v4
with:
ref: main
path: workspace

- name: Checkout benchmark data
uses: actions/checkout@v4
with:
ref: benchmark-history
path: benchmark-data

- name: Copy results to docs
run: |
mkdir -p workspace/docs/benchmarks
cp benchmark-data/docs/benchmarks.json workspace/docs/benchmarks.json 2>/dev/null || true
cp benchmark-data/docs/index.html workspace/docs/index.html 2>/dev/null || true
# Also copy individual benchmark results if they exist
cp -r benchmark-data/docs/benchmarks/*.json workspace/docs/benchmarks/ 2>/dev/null || true

- name: Commit and push updates
working-directory: workspace
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add docs/
git commit -m "Update benchmark data" || echo "No changes to commit"
git push origin HEAD:benchmark-history
--no-benchmark \
--output-dir "site/benchmarks"
11 changes: 3 additions & 8 deletions .github/workflows/skill-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,6 @@ jobs:
# Construct arguments array for safety
ARGS=(--provider "${{ matrix.provider }}" --model "${{ matrix.model }}" --judge --verbose --report --threshold 50)

if [ -n "${{ matrix.extra_args }}" ]; then
# Split extra_args safely if needed, but for now assuming simple flags
ARGS+=(${{ matrix.extra_args }})
fi

if [ -n "${{ matrix.skill }}" ]; then
ARGS+=(--skill "${{ matrix.skill }}")
else
Expand All @@ -128,7 +123,7 @@ jobs:
uses: actions/upload-artifact@v4
with:
name: ${{ steps.artifact.outputs.name }}
path: pr-code/tests/results/
path: pr-code/tests/data-history/
retention-days: 1

consolidate:
Expand All @@ -145,10 +140,10 @@ jobs:

- uses: actions/download-artifact@v4
with:
path: pr-code/tests/results/
path: pr-code/tests/data-history/

- name: Consolidate results
run: python3 trusted-scripts/ci/consolidate_results.py --results-dir pr-code/tests/results --output-file pr-code/comment.md
run: python3 trusted-scripts/ci/consolidate_results.py --results-dir pr-code/tests/data-history --output-file pr-code/comment.md

- name: Post to PR
uses: marocchino/sticky-pull-request-comment@v2
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,6 @@ scratch-*

# CI/Validation Artifacts
comment.md
results/

# Benchmark site output
site/
58 changes: 23 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Language-agnostic AI agent skills that enforce fundamental programming principles. This repository provides specific, granular instructions that enable AI coding assistants to produce significantly higher-quality code that adheres to robust engineering standards.

| Dashboard Explorer | Code Comparison | Judge Reasoning |
| :---: | :---: | :---: |
| ![Dashboard](docs/img/dashboard.png) | ![Code Comparison](docs/img/compare-code.png) | ![Judge Results](docs/img/compare-judge-results.png) |

Adopting these skills measurably changes the output of AI models, shifting them from generating merely functional code to producing architecturally sound solutions.

## Table of Contents
Expand All @@ -15,66 +19,50 @@ Adopting these skills measurably changes the output of AI models, shifting them

## Installation

Select your platform for specific setup instructions:
See:

- [Cursor](docs/install/cursor.md)
- [Antigravity](docs/install/antigravity.md)
- [GitHub Copilot](docs/install/copilot.md)
- [Claude](docs/install/claude.md)
- [Install Instructions](docs/install-instructions.md)

## How it Works

The core of this repository is the `skills/` directory. Each skill is encapsulated in its own subdirectory following the `ps-<name>` convention (e.g., `ps-composition-over-coordination`).

We use this granular structure because:

1. **Focus**: It allows the AI to load only the relevant context for a specific task, avoiding context window pollution.
2. **Modularity**: Skills can be improved, versioned, and tested independently.
3. **Composability**: Users can select the specific combination of principles they want to enforce for their project.

## Skill Integration

Skills should live under the `skills/` directory as `SKILL.md` files. For a full integration guide and documentation index:

```
https://agentskills.io/integrate-skills
https://agentskills.io/llms.txt
```

## Validation & Testing

Every skill is validated against a rigorous testing suite found in the `tests/` directory.

- **Automated Judging**: We use an LLM-as-a-Judge approach. The system compares the output of a "Baseline" model (without the skill) against a "Skill" model (with the skill loaded).
- **Semantics over Syntax**: The test does not just look for passing unit tests; it analyzes the *logic* and *structure* of the code.
- **Semantics over Syntax**: The test does not just look for passing unit tests; it analyzes the _logic_ and _structure_ of the code.
- **Evidence-Based**: The judge identifies the specific lines of code that demonstrate adherence to or violation of the principle.

[Read our Case Study on Judge Fairness](docs/judge-fairness-case-study.md) to see how the system fairly evaluates architectural quality, even when it means failing the Skill model.

## Evaluation Results

Processed 24 evaluation(s).

| Test Name | Model | Baseline | With Skill | Cases Pass | Winner |
|-----------|-------|----------|------------|------------|--------|
| [results-ollama-devstral-small-2--24b-cloud-ps-composition-over-coordination](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329534502) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | N/A |
| [results-ollama-devstral-small-2--24b-cloud-ps-error-handling-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329534340) | devstral-small-2:24b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-boundaries-adapters](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329535854) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-ownership-lifecycle](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329535894) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-explicit-state-invariants](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329537422) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-functional-core-imperative-shell](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329537046) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-illegal-states-unrepresentable](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329538523) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-local-reasoning](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329538780) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-minimize-mutation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329540068) | devstral-small-2:24b-cloud | good | good | ✅ 2/2 | N/A |
| [results-ollama-devstral-small-2--24b-cloud-ps-naming-as-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329540040) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-policy-mechanism-separation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329541792) | devstral-small-2:24b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-devstral-small-2--24b-cloud-ps-single-direction-data-flow](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329541535) | devstral-small-2:24b-cloud | regular | good | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-composition-over-coordination](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329524580) | rnj-1:8b-cloud | outstanding | good | ❌ 2/2 | Baseline |
| [results-ollama-rnj-1--8b-cloud-ps-error-handling-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329524126) | rnj-1:8b-cloud | vague | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-explicit-boundaries-adapters](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329526125) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-explicit-ownership-lifecycle](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329526263) | rnj-1:8b-cloud | good | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-explicit-state-invariants](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329528479) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-functional-core-imperative-shell](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329527817) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-illegal-states-unrepresentable](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329529527) | rnj-1:8b-cloud | outstanding | outstanding | ✅ 2/2 | N/A |
| [results-ollama-rnj-1--8b-cloud-ps-local-reasoning](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329529241) | rnj-1:8b-cloud | vague | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-minimize-mutation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329531124) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-naming-as-design](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329531393) | rnj-1:8b-cloud | vague | good | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-policy-mechanism-separation](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329532599) | rnj-1:8b-cloud | regular | outstanding | ✅ 2/2 | With Skill |
| [results-ollama-rnj-1--8b-cloud-ps-single-direction-data-flow](https://github.com/Ariel-Rodriguez/programming-skills/actions/runs/21547621647/artifacts/5329532551) | rnj-1:8b-cloud | vague | good | ✅ 2/2 | With Skill |
Dashboard:

```
https://ariel-rodriguez.github.io/programming-skills/
```

## Documentation

- [Architecture](docs/architecture.md) - Repository design & structure
- [Architecture](docs/specs/architecture.md) - Repository design & structure
- [Contributing](docs/contributing.md) - How to add/modify skills & benchmarks
- [AI Prompt Wrapper](docs/ai-prompt-wrapper.md) - Configure your AI assistant
- [Changelog](CHANGELOG.md) - Version history & skill changes
Expand Down
6 changes: 3 additions & 3 deletions ci/consolidate_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def main():
parser = argparse.ArgumentParser(description="Consolidate evaluation results")
parser.add_argument("--mode", choices=["pr-comment", "benchmark"], default="pr-comment",
help="Output mode: pr-comment or benchmark")
parser.add_argument("--results-dir", type=Path, default="tests/results",
parser.add_argument("--results-dir", type=Path, default="tests/data-history",
help="Directory containing evaluation results")
parser.add_argument("--output-dir", type=Path, default=None,
help="Output directory for benchmark mode")
Expand All @@ -126,8 +126,8 @@ def main():

print(f"==> Consolidating results (mode: {args.mode})")

# Find all summary.json files
summary_files = sorted(args.results_dir.glob("*/summary.json"))
# Find all summary files
summary_files = sorted(args.results_dir.glob("**/summary-*.json"))

if not summary_files:
print(f"No results found in {args.results_dir}")
Expand Down
Loading