Skip to content

Commit e54a692

Browse files
authored
feat: migrate benchmark quality pipeline to policy + OSS tooling (#4) (#29)
* feat: add benchmark aggregation fields and stats validation (#14) * feat: add variance and outlier policy checks (#15) * ci: enforce benchmark quality thresholds and publish summary artifact (#16) * docs: add OSS benchmark statistics migration design spec * feat: migrate benchmark quality pipeline to policy + OSS tooling * chore: generalize ignore rules for generated benchmark outputs * fix: restrict benchmark summary output path to results/latest
1 parent b4f43c2 commit e54a692

14 files changed

Lines changed: 1059 additions & 137 deletions

.github/workflows/ci.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,20 @@ jobs:
4040
- uses: actions/setup-go@v5
4141
with:
4242
go-version-file: go.mod
43+
- name: Install benchmark quality tools
44+
run: |
45+
sudo apt-get update
46+
sudo apt-get install -y hyperfine
47+
go install golang.org/x/perf/cmd/benchstat@latest
48+
echo "$(go env GOPATH)/bin" >> "$GITHUB_PATH"
4349
- name: Run benchmark script smoke
4450
run: bash scripts/run-all.sh
4551
- name: Generate report from raw results
4652
run: python3 scripts/generate-report.py
53+
- name: Run statistical quality gate
54+
run: make ci-benchmark-quality-check
55+
- name: Upload benchmark quality summary
56+
uses: actions/upload-artifact@v4
57+
with:
58+
name: benchmark-quality-summary
59+
path: results/latest/benchmark-quality-summary.json

.gitignore

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
11
# Generated benchmark artifacts
2-
results/latest/raw/*.json
3-
results/latest/summary.json
4-
results/latest/report.md
2+
results/latest/**
3+
!results/latest/.gitkeep
4+
5+
# Local coverage outputs
6+
coverage.out
7+
.coverage*
8+
*.coverprofile
9+
10+
# Python cache/bytecode
11+
**/__pycache__/
12+
*.py[cod]

METHODOLOGY.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,24 +20,34 @@
2020
- Docker + Docker Compose for service orchestration
2121
- Go parity runner (`cmd/parity-test`)
2222
- shell scripts in `scripts/` for orchestration
23-
- Python 3 report generator (`scripts/generate-report.py`)
23+
- `hyperfine` benchmark engine (optional via `BENCH_ENGINE=hyperfine`)
24+
- `benchstat` statistical comparison for quality gates
25+
- policy file: `stats-policy.yaml`
26+
- Python 3 report and normalization tooling in `scripts/`
2427

2528
## Baseline benchmark profile
2629

27-
- warmup requests: 1000
28-
- request threads: 12
29-
- concurrent connections: 400
30-
- run duration: 30s
30+
- warmup requests: 100 (legacy engine path)
31+
- benchmark requests per run: 300
3132
- runs per target: 3 (median reported)
3233

34+
## Quality policy
35+
36+
- thresholds and required metrics are defined in `stats-policy.yaml`
37+
- `make ci-benchmark-quality-check` enforces policy locally and in CI
38+
- benchstat comparisons are evaluated against policy baseline framework (`baseline` by default)
39+
3340
## Reporting
3441

3542
- raw run outputs: `results/latest/raw/`
3643
- normalized summary: `results/latest/summary.json`
3744
- markdown report: `results/latest/report.md`
45+
- quality summary: `results/latest/benchmark-quality-summary.json`
46+
- optional tool artifacts: `results/latest/tooling/benchstat/*.txt`
3847

3948
## Interpretation guidance
4049

4150
- treat parity failures as correctness blockers, not performance regressions
4251
- compare medians first, then inspect distribution variance
52+
- use benchstat deltas and policy thresholds for pass/fail interpretation
4353
- annotate environment drift (host type, CPU, memory, Docker version) in report notes

Makefile

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check
1+
.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check benchmark-stats-check benchmark-variance-check benchmark-benchstat-check ci-benchmark-quality-check
22

33
benchmark:
44
bash scripts/run-all.sh
@@ -44,3 +44,15 @@ benchmark-limits-check:
4444

4545
benchmark-manifest-check:
4646
python3 scripts/environment-manifest.py check-manifest --file results/latest/environment.manifest.json
47+
48+
benchmark-stats-check:
49+
python3 scripts/benchmark-quality-check.py stats-check
50+
51+
benchmark-variance-check:
52+
python3 scripts/benchmark-quality-check.py variance-check
53+
54+
benchmark-benchstat-check:
55+
python3 scripts/benchmark-quality-check.py benchstat-check
56+
57+
ci-benchmark-quality-check:
58+
python3 scripts/benchmark-quality-check.py ci-check

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,22 @@ Run benchmark orchestration and generate a report:
2121
```bash
2222
make benchmark
2323
make report
24+
make ci-benchmark-quality-check
2425
```
2526

27+
Use OSS measurement engine (optional):
28+
29+
```bash
30+
BENCH_ENGINE=hyperfine make benchmark
31+
```
32+
33+
## Tooling prerequisites
34+
35+
- Go (for `go test` and `benchstat`)
36+
- Python 3
37+
- hyperfine (optional benchmark engine)
38+
- benchstat (`go install golang.org/x/perf/cmd/benchstat@latest`)
39+
2640
## Repository layout
2741

2842
```text

docs/architecture.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,15 @@ results/latest/ benchmark outputs and generated report
3636

3737
1. Launch target services
3838
2. Run parity checks per target
39-
3. Run load benchmarks for parity-passing targets
40-
4. Save raw outputs
41-
5. Build `summary.json`
42-
6. Generate `report.md`
39+
3. Run load benchmarks for parity-passing targets (`legacy` engine or `hyperfine`)
40+
4. Normalize and save raw outputs
41+
5. Run policy quality gates (`stats-policy.yaml` + benchstat)
42+
6. Build `summary.json`
43+
7. Generate `report.md`
4344

4445
## Failure model
4546

4647
- parity failures do not stop fixture file iteration; they aggregate and fail at the end
4748
- benchmark runs should short-circuit per target if parity fails
4849
- report generation should tolerate partial target results and mark skipped targets
50+
- quality gate summary should always be emitted, including all-skipped smoke runs
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Design 003: OSS-Based Benchmark Statistics and CI Quality Gates
2+
3+
## 1. Goal
4+
Replace custom statistical processing logic with OSS benchmark/statistics tooling while preserving repository-specific benchmark orchestration, parity-first gating, and artifact contracts.
5+
6+
**Why:**
7+
- OSS tools reduce maintenance burden and improve methodological confidence.
8+
- Statistical quality rules should be explicit, versioned, and enforced consistently in local runs and CI.
9+
- Existing result/report paths should stay stable to avoid breaking downstream workflows.
10+
11+
## 2. Scope
12+
In:
13+
1. Integrate `hyperfine` as the benchmark measurement engine.
14+
2. Integrate `benchstat` for statistical pass/fail checks.
15+
3. Add a versioned policy file (`stats-policy.yaml`) for thresholds and rules.
16+
4. Keep `results/latest/*` artifacts stable for summary/report consumers.
17+
5. Update docs to reflect the new measurement and quality gate model.
18+
19+
Out:
20+
1. Replacing parity contract behavior or matcher semantics.
21+
2. Dashboard or visualization redesign.
22+
3. Framework-specific performance tuning unrelated to methodology.
23+
24+
## 3. Non-Goals
25+
1. Rewriting the benchmark orchestrator from scratch.
26+
2. Removing all custom scripts (thin adapters/orchestration remain expected).
27+
3. Changing issue-driven CI policy outside benchmark quality gates.
28+
29+
## 4. Architecture
30+
31+
### 4.1. Target Flow
32+
1. **Parity Gate (existing behavior):** health check -> parity check per target.
33+
2. **Measurement:** run benchmark samples using `hyperfine`.
34+
3. **Normalization:** transform tool-native output into repo raw schema.
35+
4. **Quality Gate:** run `benchstat`-based policy checks.
36+
5. **Reporting:** generate `summary.json` and `report.md` from normalized artifacts.
37+
38+
### 4.2. What Remains Custom
39+
1. Framework matrix orchestration and target routing.
40+
2. Parity-first skip behavior and skip reason recording.
41+
3. Artifact shaping for `results/latest/raw/*.json` and report pipeline compatibility.
42+
43+
### 4.3. What Moves to OSS
44+
1. Run scheduling/statistical sampling mechanics -> `hyperfine`.
45+
2. Statistical comparison/significance logic -> `benchstat`.
46+
47+
## 5. Policy Design
48+
49+
### 5.1. `stats-policy.yaml` (single source of truth)
50+
Policy fields:
51+
- significance (`alpha`), default `0.05`
52+
- minimum run count per target
53+
- regression thresholds by metric (percent-based)
54+
- required metrics (must exist in normalized artifact)
55+
- skip handling rules (`skipped` targets do not fail run by themselves)
56+
57+
### 5.2. Policy Enforcement Rules
58+
1. No pass/fail decision without policy file.
59+
2. Local and CI commands must use the same policy file.
60+
3. Violations must emit actionable diagnostics per framework/metric.
61+
4. Quality summary output is mandatory even when all targets are skipped.
62+
63+
## 6. Artifact Contract
64+
65+
### 6.1. Stable Artifacts (must remain)
66+
- `results/latest/raw/*.json`
67+
- `results/latest/summary.json`
68+
- `results/latest/report.md`
69+
- `results/latest/benchmark-quality-summary.json`
70+
71+
### 6.2. Optional Tool-Native Artifacts
72+
- `results/latest/tooling/hyperfine/*.json`
73+
- `results/latest/tooling/benchstat/*.txt`
74+
75+
## 7. CI Design
76+
CI keeps `make ci-benchmark-quality-check` as the primary gate and:
77+
1. runs benchmark pipeline,
78+
2. runs quality policy check,
79+
3. uploads benchmark quality summary and tool-native outputs as artifacts,
80+
4. fails only on policy violations (not on expected parity/health skips).
81+
82+
## 8. Migration Plan
83+
84+
### Phase A: Policy + Interfaces
85+
1. Add `stats-policy.yaml`.
86+
2. Define normalized schema compatibility contract.
87+
3. Add adapter interfaces without changing default execution path.
88+
89+
### Phase B: OSS Integration in Parallel
90+
1. Add `BENCH_ENGINE=hyperfine` execution path.
91+
2. Normalize tool output to existing raw schema.
92+
3. Add `benchstat` gate in report-only mode.
93+
94+
### Phase C: Gate Cutover
95+
1. Switch `ci-benchmark-quality-check` to policy-enforcing mode.
96+
2. Keep artifact outputs and names stable.
97+
98+
### Phase D: Cleanup
99+
1. Remove superseded custom variance/outlier math.
100+
2. Retain thin orchestration and normalization glue only.
101+
102+
## 9. Risks and Mitigations
103+
1. **Risk:** command-level timing differs from request-loop timing.
104+
**Mitigation:** add a benchmark runner wrapper so each invocation is semantically consistent.
105+
2. **Risk:** policy too strict causes unstable CI.
106+
**Mitigation:** ramp from report-only to enforced mode after calibration window.
107+
3. **Risk:** artifact drift breaks report tooling.
108+
**Mitigation:** keep normalized schema contract stable and versioned.
109+
110+
## 10. Verification Plan
111+
Required local verification before merge:
112+
```bash
113+
go test ./... -coverprofile=coverage.out -covermode=atomic
114+
make benchmark
115+
make report
116+
make ci-benchmark-quality-check
117+
```
118+
119+
Quality acceptance:
120+
1. policy file is loaded and applied in local + CI runs,
121+
2. quality summary artifact is generated every run,
122+
3. parity-first skip semantics remain unchanged,
123+
4. report generation remains deterministic from normalized artifacts.
124+
125+
## 11. Documentation Update Plan
126+
The implementation for this design must include synchronized doc updates:
127+
1. `METHODOLOGY.md` - replace custom-stat narrative with OSS toolchain and policy model.
128+
2. `docs/guides/benchmark-workflow.md` - add required tools, execution flow, and artifact references.
129+
3. `docs/architecture.md` - update performance plane and quality gate stage boundaries.
130+
4. `README.md` - refresh quickstart/validation commands and tool prerequisites.
131+
5. `docs/design/003-benchmark-statistics-oss-migration.md` - keep as design source of truth.
132+
133+
## 12. Rollback Strategy
134+
If OSS migration introduces instability:
135+
1. toggle back to legacy engine path,
136+
2. keep parity and artifact generation operational,
137+
3. continue emitting quality summary with explicit `mode: legacy` marker,
138+
4. re-enable OSS path after threshold recalibration.

docs/guides/benchmark-workflow.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
- targets available locally or via Docker Compose
66
- parity contract fixtures up to date
7+
- benchmark quality tools installed locally:
8+
- `hyperfine` (for `BENCH_ENGINE=hyperfine`)
9+
- `benchstat` (`go install golang.org/x/perf/cmd/benchstat@latest`)
710

811
## Standard run
912

@@ -21,6 +24,12 @@ make benchmark-nestjs
2124

2225
Per-target runs also emit `results/latest/environment.fingerprint.json` and `results/latest/environment.manifest.json`.
2326

27+
Optional OSS measurement engine:
28+
29+
```bash
30+
BENCH_ENGINE=hyperfine make benchmark
31+
```
32+
2433
## Docker resource limits
2534

2635
Framework services use shared default limits from `docker-compose.yml`:
@@ -45,6 +54,19 @@ Benchmark scripts must run parity first for each target. If parity fails, skip b
4554
- `results/latest/environment.manifest.json` - timestamped runner metadata and result index
4655
- `results/latest/summary.json` - normalized summary
4756
- `results/latest/report.md` - markdown report
57+
- `results/latest/benchmark-quality-summary.json` - policy quality gate output
58+
- `results/latest/tooling/benchstat/*.txt` - benchstat comparison outputs
59+
60+
## Quality checks
61+
62+
```bash
63+
make benchmark-stats-check
64+
make benchmark-variance-check
65+
make benchmark-benchstat-check
66+
make ci-benchmark-quality-check
67+
```
68+
69+
Quality thresholds and required metrics are versioned in `stats-policy.yaml`.
4870

4971
## Reproducibility notes
5072

0 commit comments

Comments
 (0)