Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions msbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
# MSBench — Skills Evaluation Benchmark

## What is MSBench?

[MSBench](https://msbenchapp.azurewebsites.net/) is Microsoft's SWE-Bench evaluation platform for assessing AI coding assistants.
It runs agent tasks inside Docker containers on a cloud execution backend (CES),
producing standardised `eval.json` results that feed a leaderboard and historical tracking.

Key components:

| Component | Purpose |
|-----------|---------|
| **msbench-cli** | Submit runs, monitor progress, generate reports |
| **CES** (Code Execution Service) | Azure-hosted Docker execution backend |
| **ACR** | Container registry (`codeexecservice.azurecr.io`) |
| **[msbench-benchmarks](https://dev.azure.com/devdiv/OnlineServices/_git/msbench-benchmarks)** | Benchmark datasets, curation scripts, parquet database |
| **harbor-format-curation** | Converts Harbor-format tasks → MSBench Docker images |
| **Leaderboard** | <https://msbenchapp.azurewebsites.net/> |

For full platform documentation see the [MSBench wiki](https://github.com/devdiv-microsoft/MicrosoftSweBench/wiki).

## What this folder does

The `msbench/` tree contains the **`dotnetskills` benchmark** — a set of
[Harbor-format](https://github.com/devdiv-microsoft/MicrosoftSweBench/wiki/3.-Adding-a-benchmark) tasks
that evaluate whether agent skills (from `plugins/`) actually improve an
agent's ability to solve .NET tasks.

It follows an **A/B pattern** (identical to the existing
[skillsbench / skillsbenchnoskills](https://dev.azure.com/devdiv/OnlineServices/_git/msbench-benchmarks?path=/benchmarks/skillsbench)
benchmark in `msbench-benchmarks`):
tasks are run twice — once with skills loaded and once without —
and the resolve-rate delta shows the real-world value of every skill.

### Opting in to MSBench (`msbench_ready`)

Not every eval.yaml is automatically converted to an MSBench task.
Each eval.yaml must explicitly opt in by setting the **top-level flag**
`msbench_ready: true`:

```yaml
msbench_ready: true # ← opt-in to MSBench onboarding

scenarios:
- name: "My scenario"
prompt: "..."
assertions: [...]
```

Evals without this flag are silently skipped by the converter.
This keeps the benchmark small and focused while new evals are being
developed and validated locally via the skill-validator.

### Currently onboarded evals

| Eval | Plugin | Scenarios | Why selected |
|------|--------|-----------|--------------|
| `csharp-scripts` | dotnet | 1 | Clear deterministic assertions, proven locally |
| `dotnet-pinvoke` | dotnet | 2 | Pure output-based with 4 strong assertions per scenario |
| `msbuild-modernization` | dotnet-msbuild | 1 | Covers msbuild plugin, uses `copy_test_files` setup |

To onboard additional evals, add `msbench_ready: true` to their
eval.yaml and re-run the converter.

### Excluded skills

5 skills are excluded entirely (regardless of `msbench_ready`) because they
depend on MCP servers not yet available in Docker or require binary artefacts
not suited for containerised execution:
`binlog-failure-analysis`, `binlog-generation`, `build-perf-diagnostics`,
`build-parallelism`, `dump-collect`.

## Folder structure

```
msbench/
├── README.md ← you are here
├── dotnetskills.toml ← harbor-format-curation config (local + prod Docker profiles)
├── version.txt ← benchmark version (SemVer)
├── tasks/ ← Harbor-format tasks (auto-generated — do not edit by hand)
│ └── <plugin>--<skill>--<slug>/
│ ├── task.toml ← metadata, tags, difficulty, resource limits
│ ├── instruction.md ← agent prompt (the problem statement)
│ ├── environment/ ← Docker build context
│ │ ├── Dockerfile ← .NET SDK image + fixture files + eval helpers
│ │ ├── eval_helpers/ ← assertion_runner.sh, write_eval.py, …
│ │ ├── fixtures/ ← (if the scenario has source fixtures)
│ │ └── test_files/ ← (if the scenario uses copy_test_files)
│ ├── tests/
│ │ └── test.sh ← evaluation script → writes /output/eval.json
│ └── solution/
│ └── solve.sh ← stub (gold solutions are authored separately)
├── agents/ ← agent runner packages for the A/B pattern
│ ├── with-skills/ ← Copilot CLI + native skill loading
│ │ ├── runner.sh
│ │ └── config.yaml
│ └── without-skills/ ← Copilot CLI baseline (no skills)
│ ├── runner.sh
│ └── config.yaml
├── shared/
│ └── eval_helpers/ ← reusable evaluation scripts (copied into every task)
│ ├── assertion_runner.sh ← bash assertion framework (source in test.sh)
│ ├── write_eval.py ← generates eval.json + custom_metrics.json
│ ├── parse_build.py ← parse dotnet build output
│ ├── parse_trx.py ← parse .trx test-result files
│ └── check_pattern.py ← configurable grep-based pattern checks
└── scripts/
├── convert_evals.py ← converter: eval.yaml → Harbor tasks
├── validate_tasks.py ← E2E structural validation
├── analyze_results.py ← post-run A/B comparison report
├── prepare_agent_packages.sh ← copies in-scope SKILL.md files into agent package
└── test_convert_evals.py ← unit tests for the converter (pytest)
```

## How tasks are generated

Tasks are **not written by hand**. They are converted from the existing
`eval.yaml` files that live alongside each evaluation scenario under `tests/`
(e.g. `tests/dotnet/csharp-scripts/eval.yaml`).
The converter reads every `eval.yaml` **that has `msbench_ready: true`**,
maps each scenario to a Harbor task directory, resolves fixture paths,
generates the Dockerfile, test.sh, etc. Evals without the flag are skipped.

```powershell
# Regenerate all tasks from eval.yaml sources
python msbench/scripts/convert_evals.py `
--skills-dir plugins/ `
--tests-dir tests/ `
--output-dir msbench/tasks/
```

Useful converter modes:

| Flag | Behaviour |
|------|-----------|
| *(none)* | Generate / overwrite all task directories |
| `--dry-run` | Print what *would* be generated without writing files |
| `--check` | Verify existing tasks are in sync with eval.yaml; exit non-zero on drift |

After regenerating, validate:

```powershell
python msbench/scripts/validate_tasks.py `
--tasks-dir msbench/tasks/ `
--tests-dir tests/ `
--skills-dir plugins/
```

## Local usage

### Prerequisites

- Python 3.10+ with `pyyaml` installed (`pip install pyyaml`)
- Docker (for building / running images locally)
- `msbench-cli` installed ([MicrosoftSweBench](https://dev.azure.com/devdiv/InternalTools/_git/MicrosoftSweBench))

### 1. Generate and validate tasks

```powershell
# From the repo root
python msbench/scripts/convert_evals.py --skills-dir plugins/ --tests-dir tests/ --output-dir msbench/tasks/
python msbench/scripts/validate_tasks.py --tasks-dir msbench/tasks/ --tests-dir tests/ --skills-dir plugins/
```

### 2. Build Docker images locally

Use `harbor-format-curation` (from the `msbench-benchmarks` repo) pointed at
`dotnetskills.toml` with the `docker.local` profile:

```bash
# Inside the msbench-benchmarks repo (with harbor-format-curation installed)
harbor-curation build \
--config /path/to/skills/msbench/dotnetskills.toml \
--profile local \
--tasks-dir /path/to/skills/msbench/tasks/
```

This builds images tagged like
`localhost:5000/dotnetskills.eval.x86_64.<task-name>:msbench-0.1.0`.

### 3. Run a single task manually

```bash
docker run --rm \
-v /tmp/output:/output \
localhost:5000/dotnetskills.eval.x86_64.<task-name>:msbench-0.1.0
```

The container produces `/output/eval.json` with the result.

### 4. Submit via msbench-cli (against CES)

```bash
# With skills
msbench-cli run submit \
--benchmark dotnetskills \
--agent-dir msbench/agents/with-skills/ \
--tag skills=enabled

# Without skills (baseline)
msbench-cli run submit \
--benchmark dotnetskills \
--agent-dir msbench/agents/without-skills/ \
--tag skills=disabled
```

### 5. Compare results

After both runs complete, download the results and diff them:

```bash
python msbench/scripts/analyze_results.py \
--with-skills results/with-skills/ \
--without-skills results/without-skills/
```

This prints a report showing per-task, per-skill, and overall resolve-rate
delta between the two runs.

### Running unit tests

```powershell
python -m pytest msbench/scripts/test_convert_evals.py -v
```

## Pipeline usage (CI/CD)

The benchmark is designed to run in Azure Pipelines. A typical pipeline does:

1. **Generate & validate** — run the converter in `--check` mode to ensure
tasks are in sync; fail the build if they drift.
2. **Build images** — invoke `harbor-format-curation` with the `docker.prod`
profile to build and push images to ACR
(`codeexecservice.azurecr.io`).
3. **Submit A/B runs** — use `msbench-cli run submit` twice (with and
without skills), blocking until both complete.
4. **Analyse** — run `analyze_results.py` on the two result sets and
publish the summary as a pipeline artefact.

### Pipeline-specific environment

| Variable | Purpose |
|----------|---------|
| `BENCHMARK_PARQUET_PATH` | Path to the benchmark parquet (set by the msbench-benchmarks package) |
| `CES_ENVIRONMENT` | CES backend to target (`ces-dev1`, `ces-staging`, `ces-ame`) |

### Sync-check gate

Add this as an early pipeline step to fail fast if someone edits an
`eval.yaml` without regenerating the Harbor tasks:

```yaml
- script: |
python msbench/scripts/convert_evals.py \
--skills-dir plugins/ \
--tests-dir tests/ \
--output-dir msbench/tasks/ \
--check
displayName: "Verify Harbor tasks are in sync with eval.yaml"
```

## Versioning

The benchmark version lives in `version.txt` and follows SemVer:

- **Patch** — refresh task content (re-run converter after eval.yaml edits)
- **Minor** — add new skills or tasks
- **Major** — breaking changes to the evaluation schema

Image tags follow the pattern
`dotnetskills.eval.x86_64.<task-name>:msbench-<version>`.
15 changes: 15 additions & 0 deletions msbench/agents/with-skills/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Agent configuration for Copilot CLI with skills enabled
agent:
name: "github-copilot-cli"
description: "GitHub Copilot CLI with dotnet skills loaded via native plugin discovery"
tags:
skills: "enabled"

skills:
enabled: true
directories:
- "/agent/skills/dotnet"
- "/agent/skills/dotnet-msbuild"

resources:
timeout_sec: 600
25 changes: 25 additions & 0 deletions msbench/agents/with-skills/runner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
set -euo pipefail

# Read instance metadata
METADATA_PATH="${METADATA_PATH:-/drop/metadata.json}"
INSTANCE_ID=$(python3 -c "import json; print(json.load(open('$METADATA_PATH'))['instance_id'])")

# The Copilot CLI special agent is configured to load skills from /agent/skills/
# via its native SessionConfig.SkillDirectories mechanism.

# Write skill metadata for custom_metrics tracking
SKILL_NAME=$(echo "$INSTANCE_ID" | sed 's/--/\n/g' | head -2 | tail -1)
PLUGIN_NAME=$(echo "$INSTANCE_ID" | sed 's/--/\n/g' | head -1)
SKILL_DIR="/agent/skills/${PLUGIN_NAME}/skills/${SKILL_NAME}"

echo "{\"skill_dir\": \"$SKILL_DIR\", \"skill_injected\": $([ -d \"$SKILL_DIR\" ] && echo true || echo false)}" > /agent/skill_metadata.json

# Copilot CLI invocation with native skill loading:
# The --skill-dirs flag points to the plugin directories:
ghcs run \
--skill-dirs /agent/skills/dotnet,/agent/skills/dotnet-msbuild \
--workspace /testbed \
--prompt-file /drop/metadata.json \
--output-dir /output \
2>&1 | tee /output/trajectory.txt
12 changes: 12 additions & 0 deletions msbench/agents/without-skills/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Agent configuration for Copilot CLI without skills (baseline)
agent:
name: "github-copilot-cli"
description: "GitHub Copilot CLI baseline — no skills loaded"
tags:
skills: "disabled"

skills:
enabled: false

resources:
timeout_sec: 600
13 changes: 13 additions & 0 deletions msbench/agents/without-skills/runner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash
set -euo pipefail

METADATA_PATH="${METADATA_PATH:-/drop/metadata.json}"

# No skill directories — baseline Copilot CLI run
echo "{\"skill_injected\": false}" > /agent/skill_metadata.json

ghcs run \
--workspace /testbed \
--prompt-file /drop/metadata.json \
--output-dir /output \
2>&1 | tee /output/trajectory.txt
38 changes: 38 additions & 0 deletions msbench/dotnetskills.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[maintainer]
name = "dotnet/skills team"
team = ".NET Developer Experience"

[import]
from = "path"
path = "c:/src/skills/msbench/tasks"
staging-folder = "${TEMP}/harborcuration/dotnetskills"
dataset = "dotnetskills"

# Original remote config (for CI/prod):
# from = "path-clone"
# repository = "https://dev.azure.com/devdiv/_git/skills"
# path = "msbench/tasks"

[docker.local]
benchmark = "dotnetskills"
type = "eval"
architecture = "x86_64"
version = "0.1.0"
registry = "localhost:5000"
provider = "Docker localhost registry"

[docker.prod]
benchmark = "dotnetskills"
type = "eval"
architecture = "x86_64"
version = "0.1.0"
registry = "codeexecservice.azurecr.io"
provider = "Azure container registry"

[tasks.exclude]
# Tasks that fail to build (maintain as needed)

[tasks.skip-verification]
# Tasks where oracle verification is skipped (analysis-style prompts)
# Analysis tasks don't have deterministic gold solutions
# dotnet--analyzing-dotnet-performance--compiled-regex-startup-budget-regex-chain = [{skip = "oracle", reason = "Analysis task, no deterministic gold solution"}]
Loading