|
| 1 | +# Design 003: OSS-Based Benchmark Statistics and CI Quality Gates |
| 2 | + |
| 3 | +## 1. Goal |
| 4 | +Replace custom statistical processing logic with OSS benchmark/statistics tooling while preserving repository-specific benchmark orchestration, parity-first gating, and artifact contracts. |
| 5 | + |
| 6 | +**Why:** |
| 7 | +- OSS tools reduce maintenance burden and improve methodological confidence. |
| 8 | +- Statistical quality rules should be explicit, versioned, and enforced consistently in local runs and CI. |
| 9 | +- Existing result/report paths should stay stable to avoid breaking downstream workflows. |
| 10 | + |
| 11 | +## 2. Scope |
| 12 | +In: |
| 13 | +1. Integrate `hyperfine` as the benchmark measurement engine. |
| 14 | +2. Integrate `benchstat` for statistical pass/fail checks. |
| 15 | +3. Add a versioned policy file (`stats-policy.yaml`) for thresholds and rules. |
| 16 | +4. Keep `results/latest/*` artifacts stable for summary/report consumers. |
| 17 | +5. Update docs to reflect the new measurement and quality gate model. |
| 18 | + |
| 19 | +Out: |
| 20 | +1. Replacing parity contract behavior or matcher semantics. |
| 21 | +2. Dashboard or visualization redesign. |
| 22 | +3. Framework-specific performance tuning unrelated to methodology. |
| 23 | + |
| 24 | +## 3. Non-Goals |
| 25 | +1. Rewriting the benchmark orchestrator from scratch. |
| 26 | +2. Removing all custom scripts (thin adapters/orchestration remain expected). |
| 27 | +3. Changing issue-driven CI policy outside benchmark quality gates. |
| 28 | + |
| 29 | +## 4. Architecture |
| 30 | + |
| 31 | +### 4.1. Target Flow |
| 32 | +1. **Parity Gate (existing behavior):** health check -> parity check per target. |
| 33 | +2. **Measurement:** run benchmark samples using `hyperfine`. |
| 34 | +3. **Normalization:** transform tool-native output into repo raw schema. |
| 35 | +4. **Quality Gate:** run `benchstat`-based policy checks. |
| 36 | +5. **Reporting:** generate `summary.json` and `report.md` from normalized artifacts. |
| 37 | + |
| 38 | +### 4.2. What Remains Custom |
| 39 | +1. Framework matrix orchestration and target routing. |
| 40 | +2. Parity-first skip behavior and skip reason recording. |
| 41 | +3. Artifact shaping for `results/latest/raw/*.json` and report pipeline compatibility. |
| 42 | + |
| 43 | +### 4.3. What Moves to OSS |
| 44 | +1. Run scheduling/statistical sampling mechanics -> `hyperfine`. |
| 45 | +2. Statistical comparison/significance logic -> `benchstat`. |
| 46 | + |
| 47 | +## 5. Policy Design |
| 48 | + |
| 49 | +### 5.1. `stats-policy.yaml` (single source of truth) |
| 50 | +Policy fields: |
| 51 | +- significance (`alpha`), default `0.05` |
| 52 | +- minimum run count per target |
| 53 | +- regression thresholds by metric (percent-based) |
| 54 | +- required metrics (must exist in normalized artifact) |
| 55 | +- skip handling rules (`skipped` targets do not fail run by themselves) |
| 56 | + |
| 57 | +### 5.2. Policy Enforcement Rules |
| 58 | +1. No pass/fail decision without policy file. |
| 59 | +2. Local and CI commands must use the same policy file. |
| 60 | +3. Violations must emit actionable diagnostics per framework/metric. |
| 61 | +4. Quality summary output is mandatory even when all targets are skipped. |
| 62 | + |
| 63 | +## 6. Artifact Contract |
| 64 | + |
| 65 | +### 6.1. Stable Artifacts (must remain) |
| 66 | +- `results/latest/raw/*.json` |
| 67 | +- `results/latest/summary.json` |
| 68 | +- `results/latest/report.md` |
| 69 | +- `results/latest/benchmark-quality-summary.json` |
| 70 | + |
| 71 | +### 6.2. Optional Tool-Native Artifacts |
| 72 | +- `results/latest/tooling/hyperfine/*.json` |
| 73 | +- `results/latest/tooling/benchstat/*.txt` |
| 74 | + |
| 75 | +## 7. CI Design |
| 76 | +CI keeps `make ci-benchmark-quality-check` as the primary gate and: |
| 77 | +1. runs benchmark pipeline, |
| 78 | +2. runs quality policy check, |
| 79 | +3. uploads benchmark quality summary and tool-native outputs as artifacts, |
| 80 | +4. fails only on policy violations (not on expected parity/health skips). |
| 81 | + |
| 82 | +## 8. Migration Plan |
| 83 | + |
| 84 | +### Phase A: Policy + Interfaces |
| 85 | +1. Add `stats-policy.yaml`. |
| 86 | +2. Define normalized schema compatibility contract. |
| 87 | +3. Add adapter interfaces without changing default execution path. |
| 88 | + |
| 89 | +### Phase B: OSS Integration in Parallel |
| 90 | +1. Add `BENCH_ENGINE=hyperfine` execution path. |
| 91 | +2. Normalize tool output to existing raw schema. |
| 92 | +3. Add `benchstat` gate in report-only mode. |
| 93 | + |
| 94 | +### Phase C: Gate Cutover |
| 95 | +1. Switch `ci-benchmark-quality-check` to policy-enforcing mode. |
| 96 | +2. Keep artifact outputs and names stable. |
| 97 | + |
| 98 | +### Phase D: Cleanup |
| 99 | +1. Remove superseded custom variance/outlier math. |
| 100 | +2. Retain thin orchestration and normalization glue only. |
| 101 | + |
| 102 | +## 9. Risks and Mitigations |
| 103 | +1. **Risk:** command-level timing differs from request-loop timing. |
| 104 | + **Mitigation:** add a benchmark runner wrapper so each invocation is semantically consistent. |
| 105 | +2. **Risk:** policy too strict causes unstable CI. |
| 106 | + **Mitigation:** ramp from report-only to enforced mode after calibration window. |
| 107 | +3. **Risk:** artifact drift breaks report tooling. |
| 108 | + **Mitigation:** keep normalized schema contract stable and versioned. |
| 109 | + |
| 110 | +## 10. Verification Plan |
| 111 | +Required local verification before merge: |
| 112 | +```bash |
| 113 | +go test ./... -coverprofile=coverage.out -covermode=atomic |
| 114 | +make benchmark |
| 115 | +make report |
| 116 | +make ci-benchmark-quality-check |
| 117 | +``` |
| 118 | + |
| 119 | +Quality acceptance: |
| 120 | +1. policy file is loaded and applied in local + CI runs, |
| 121 | +2. quality summary artifact is generated every run, |
| 122 | +3. parity-first skip semantics remain unchanged, |
| 123 | +4. report generation remains deterministic from normalized artifacts. |
| 124 | + |
| 125 | +## 11. Documentation Update Plan |
| 126 | +The implementation for this design must include synchronized doc updates: |
| 127 | +1. `METHODOLOGY.md` - replace custom-stat narrative with OSS toolchain and policy model. |
| 128 | +2. `docs/guides/benchmark-workflow.md` - add required tools, execution flow, and artifact references. |
| 129 | +3. `docs/architecture.md` - update performance plane and quality gate stage boundaries. |
| 130 | +4. `README.md` - refresh quickstart/validation commands and tool prerequisites. |
| 131 | +5. `docs/design/003-benchmark-statistics-oss-migration.md` - keep as design source of truth. |
| 132 | + |
| 133 | +## 12. Rollback Strategy |
| 134 | +If OSS migration introduces instability: |
| 135 | +1. toggle back to legacy engine path, |
| 136 | +2. keep parity and artifact generation operational, |
| 137 | +3. continue emitting quality summary with explicit `mode: legacy` marker, |
| 138 | +4. re-enable OSS path after threshold recalibration. |
0 commit comments