go-modkit · aryeko · Feb 7, 2026 · Feb 7, 2026 · Feb 7, 2026 · Feb 7, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -40,7 +40,20 @@ jobs:
       - uses: actions/setup-go@v5
         with:
           go-version-file: go.mod
+      - name: Install benchmark quality tools
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y hyperfine
+          go install golang.org/x/perf/cmd/benchstat@latest
+          echo "$(go env GOPATH)/bin" >> "$GITHUB_PATH"
       - name: Run benchmark script smoke
         run: bash scripts/run-all.sh
       - name: Generate report from raw results
         run: python3 scripts/generate-report.py
+      - name: Run statistical quality gate
+        run: make ci-benchmark-quality-check
+      - name: Upload benchmark quality summary
+        uses: actions/upload-artifact@v4
+        with:
+          name: benchmark-quality-summary
+          path: results/latest/benchmark-quality-summary.json
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,12 @@
 # Generated benchmark artifacts
-results/latest/raw/*.json
-results/latest/summary.json
-results/latest/report.md
+results/latest/**
+!results/latest/.gitkeep
+
+# Local coverage outputs
+coverage.out
+.coverage*
+*.coverprofile
+
+# Python cache/bytecode
+**/__pycache__/
+*.py[cod]
diff --git a/METHODOLOGY.md b/METHODOLOGY.md
@@ -20,24 +20,34 @@
 - Docker + Docker Compose for service orchestration
 - Go parity runner (`cmd/parity-test`)
 - shell scripts in `scripts/` for orchestration
-- Python 3 report generator (`scripts/generate-report.py`)
+- `hyperfine` benchmark engine (optional via `BENCH_ENGINE=hyperfine`)
+- `benchstat` statistical comparison for quality gates
+- policy file: `stats-policy.yaml`
+- Python 3 report and normalization tooling in `scripts/`
 
 ## Baseline benchmark profile
 
-- warmup requests: 1000
-- request threads: 12
-- concurrent connections: 400
-- run duration: 30s
+- warmup requests: 100 (legacy engine path)
+- benchmark requests per run: 300
 - runs per target: 3 (median reported)
 
+## Quality policy
+
+- thresholds and required metrics are defined in `stats-policy.yaml`
+- `make ci-benchmark-quality-check` enforces policy locally and in CI
+- benchstat comparisons are evaluated against policy baseline framework (`baseline` by default)
+
 ## Reporting
 
 - raw run outputs: `results/latest/raw/`
 - normalized summary: `results/latest/summary.json`
 - markdown report: `results/latest/report.md`
+- quality summary: `results/latest/benchmark-quality-summary.json`
+- optional tool artifacts: `results/latest/tooling/benchstat/*.txt`
 
 ## Interpretation guidance
 
 - treat parity failures as correctness blockers, not performance regressions
 - compare medians first, then inspect distribution variance
+- use benchstat deltas and policy thresholds for pass/fail interpretation
 - annotate environment drift (host type, CPU, memory, Docker version) in report notes
diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check
+.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check benchmark-stats-check benchmark-variance-check benchmark-benchstat-check ci-benchmark-quality-check
 
 benchmark:
 	bash scripts/run-all.sh
@@ -44,3 +44,15 @@ benchmark-limits-check:
 
 benchmark-manifest-check:
 	python3 scripts/environment-manifest.py check-manifest --file results/latest/environment.manifest.json
+
+benchmark-stats-check:
+	python3 scripts/benchmark-quality-check.py stats-check
+
+benchmark-variance-check:
+	python3 scripts/benchmark-quality-check.py variance-check
+
+benchmark-benchstat-check:
+	python3 scripts/benchmark-quality-check.py benchstat-check
+
+ci-benchmark-quality-check:
+	python3 scripts/benchmark-quality-check.py ci-check
diff --git a/README.md b/README.md
@@ -21,8 +21,22 @@ Run benchmark orchestration and generate a report:
 ```bash
 make benchmark
 make report
+make ci-benchmark-quality-check
 ```
 
+Use OSS measurement engine (optional):
+
+```bash
+BENCH_ENGINE=hyperfine make benchmark
+```
+
+## Tooling prerequisites
+
+- Go (for `go test` and `benchstat`)
+- Python 3
+- hyperfine (optional benchmark engine)
+- benchstat (`go install golang.org/x/perf/cmd/benchstat@latest`)
+
 ## Repository layout
 
 ```text

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -36,13 +36,15 @@ results/latest/          benchmark outputs and generated report
 
 1. Launch target services
 2. Run parity checks per target
-3. Run load benchmarks for parity-passing targets
-4. Save raw outputs
-5. Build `summary.json`
-6. Generate `report.md`
+3. Run load benchmarks for parity-passing targets (`legacy` engine or `hyperfine`)
+4. Normalize and save raw outputs
+5. Run policy quality gates (`stats-policy.yaml` + benchstat)
+6. Build `summary.json`
+7. Generate `report.md`
 
 ## Failure model
 
 - parity failures do not stop fixture file iteration; they aggregate and fail at the end
 - benchmark runs should short-circuit per target if parity fails
 - report generation should tolerate partial target results and mark skipped targets
+- quality gate summary should always be emitted, including all-skipped smoke runs
diff --git a/docs/design/003-benchmark-statistics-oss-migration.md b/docs/design/003-benchmark-statistics-oss-migration.md
@@ -0,0 +1,138 @@
+# Design 003: OSS-Based Benchmark Statistics and CI Quality Gates
+
+## 1. Goal
+Replace custom statistical processing logic with OSS benchmark/statistics tooling while preserving repository-specific benchmark orchestration, parity-first gating, and artifact contracts.
+
+**Why:**
+- OSS tools reduce maintenance burden and improve methodological confidence.
+- Statistical quality rules should be explicit, versioned, and enforced consistently in local runs and CI.
+- Existing result/report paths should stay stable to avoid breaking downstream workflows.
+
+## 2. Scope
+In:
+1. Integrate `hyperfine` as the benchmark measurement engine.
+2. Integrate `benchstat` for statistical pass/fail checks.
+3. Add a versioned policy file (`stats-policy.yaml`) for thresholds and rules.
+4. Keep `results/latest/*` artifacts stable for summary/report consumers.
+5. Update docs to reflect the new measurement and quality gate model.
+
+Out:
+1. Replacing parity contract behavior or matcher semantics.
+2. Dashboard or visualization redesign.
+3. Framework-specific performance tuning unrelated to methodology.
+
+## 3. Non-Goals
+1. Rewriting the benchmark orchestrator from scratch.
+2. Removing all custom scripts (thin adapters/orchestration remain expected).
+3. Changing issue-driven CI policy outside benchmark quality gates.
+
+## 4. Architecture
+
+### 4.1. Target Flow
+1. **Parity Gate (existing behavior):** health check -> parity check per target.
+2. **Measurement:** run benchmark samples using `hyperfine`.
+3. **Normalization:** transform tool-native output into repo raw schema.
+4. **Quality Gate:** run `benchstat`-based policy checks.
+5. **Reporting:** generate `summary.json` and `report.md` from normalized artifacts.
+
+### 4.2. What Remains Custom
+1. Framework matrix orchestration and target routing.
+2. Parity-first skip behavior and skip reason recording.
+3. Artifact shaping for `results/latest/raw/*.json` and report pipeline compatibility.
+
+### 4.3. What Moves to OSS
+1. Run scheduling/statistical sampling mechanics -> `hyperfine`.
+2. Statistical comparison/significance logic -> `benchstat`.
+
+## 5. Policy Design
+
+### 5.1. `stats-policy.yaml` (single source of truth)
+Policy fields:
+- significance (`alpha`), default `0.05`
+- minimum run count per target
+- regression thresholds by metric (percent-based)
+- required metrics (must exist in normalized artifact)
+- skip handling rules (`skipped` targets do not fail run by themselves)
+
+### 5.2. Policy Enforcement Rules
+1. No pass/fail decision without policy file.
+2. Local and CI commands must use the same policy file.
+3. Violations must emit actionable diagnostics per framework/metric.
+4. Quality summary output is mandatory even when all targets are skipped.
+
+## 6. Artifact Contract
+
+### 6.1. Stable Artifacts (must remain)
+- `results/latest/raw/*.json`
+- `results/latest/summary.json`
+- `results/latest/report.md`
+- `results/latest/benchmark-quality-summary.json`
+
+### 6.2. Optional Tool-Native Artifacts
+- `results/latest/tooling/hyperfine/*.json`
+- `results/latest/tooling/benchstat/*.txt`
+
+## 7. CI Design
+CI keeps `make ci-benchmark-quality-check` as the primary gate and:
+1. runs benchmark pipeline,
+2. runs quality policy check,
+3. uploads benchmark quality summary and tool-native outputs as artifacts,
+4. fails only on policy violations (not on expected parity/health skips).
+
+## 8. Migration Plan
+
+### Phase A: Policy + Interfaces
+1. Add `stats-policy.yaml`.
+2. Define normalized schema compatibility contract.
+3. Add adapter interfaces without changing default execution path.
+
+### Phase B: OSS Integration in Parallel
+1. Add `BENCH_ENGINE=hyperfine` execution path.
+2. Normalize tool output to existing raw schema.
+3. Add `benchstat` gate in report-only mode.
+
+### Phase C: Gate Cutover
+1. Switch `ci-benchmark-quality-check` to policy-enforcing mode.
+2. Keep artifact outputs and names stable.
+
+### Phase D: Cleanup
+1. Remove superseded custom variance/outlier math.
+2. Retain thin orchestration and normalization glue only.
+
+## 9. Risks and Mitigations
+1. **Risk:** command-level timing differs from request-loop timing.
+   **Mitigation:** add a benchmark runner wrapper so each invocation is semantically consistent.
+2. **Risk:** policy too strict causes unstable CI.
+   **Mitigation:** ramp from report-only to enforced mode after calibration window.
+3. **Risk:** artifact drift breaks report tooling.
+   **Mitigation:** keep normalized schema contract stable and versioned.
+
+## 10. Verification Plan
+Required local verification before merge:
+```bash
+go test ./... -coverprofile=coverage.out -covermode=atomic
+make benchmark
+make report
+make ci-benchmark-quality-check
+```
+
+Quality acceptance:
+1. policy file is loaded and applied in local + CI runs,
+2. quality summary artifact is generated every run,
+3. parity-first skip semantics remain unchanged,
+4. report generation remains deterministic from normalized artifacts.
+
+## 11. Documentation Update Plan
+The implementation for this design must include synchronized doc updates:
+1. `METHODOLOGY.md` - replace custom-stat narrative with OSS toolchain and policy model.
+2. `docs/guides/benchmark-workflow.md` - add required tools, execution flow, and artifact references.
+3. `docs/architecture.md` - update performance plane and quality gate stage boundaries.
+4. `README.md` - refresh quickstart/validation commands and tool prerequisites.
+5. `docs/design/003-benchmark-statistics-oss-migration.md` - keep as design source of truth.
+
+## 12. Rollback Strategy
+If OSS migration introduces instability:
+1. toggle back to legacy engine path,
+2. keep parity and artifact generation operational,
+3. continue emitting quality summary with explicit `mode: legacy` marker,
+4. re-enable OSS path after threshold recalibration.
diff --git a/docs/guides/benchmark-workflow.md b/docs/guides/benchmark-workflow.md
@@ -4,6 +4,9 @@
 
 - targets available locally or via Docker Compose
 - parity contract fixtures up to date
+- benchmark quality tools installed locally:
+  - `hyperfine` (for `BENCH_ENGINE=hyperfine`)
+  - `benchstat` (`go install golang.org/x/perf/cmd/benchstat@latest`)
 
 ## Standard run
 
@@ -21,6 +24,12 @@ make benchmark-nestjs
 
 Per-target runs also emit `results/latest/environment.fingerprint.json` and `results/latest/environment.manifest.json`.
 
+Optional OSS measurement engine:
+
+```bash
+BENCH_ENGINE=hyperfine make benchmark
+```
+
 ## Docker resource limits
 
 Framework services use shared default limits from `docker-compose.yml`:
@@ -45,6 +54,19 @@ Benchmark scripts must run parity first for each target. If parity fails, skip b
 - `results/latest/environment.manifest.json` - timestamped runner metadata and result index
 - `results/latest/summary.json` - normalized summary
 - `results/latest/report.md` - markdown report
+- `results/latest/benchmark-quality-summary.json` - policy quality gate output
+- `results/latest/tooling/benchstat/*.txt` - benchstat comparison outputs
+
+## Quality checks
+
+```bash
+make benchmark-stats-check
+make benchmark-variance-check
+make benchmark-benchstat-check
+make ci-benchmark-quality-check
+```
+
+Quality thresholds and required metrics are versioned in `stats-policy.yaml`.
 
 ## Reproducibility notes