Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,20 @@ jobs:
- uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Install benchmark quality tools
run: |
sudo apt-get update
sudo apt-get install -y hyperfine
go install golang.org/x/perf/cmd/benchstat@latest
echo "$(go env GOPATH)/bin" >> "$GITHUB_PATH"
- name: Run benchmark script smoke
run: bash scripts/run-all.sh
- name: Generate report from raw results
run: python3 scripts/generate-report.py
- name: Run statistical quality gate
run: make ci-benchmark-quality-check
- name: Upload benchmark quality summary
uses: actions/upload-artifact@v4
with:
name: benchmark-quality-summary
path: results/latest/benchmark-quality-summary.json
14 changes: 11 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# Generated benchmark artifacts
results/latest/raw/*.json
results/latest/summary.json
results/latest/report.md
results/latest/**
!results/latest/.gitkeep

# Local coverage outputs
coverage.out
.coverage*
*.coverprofile

# Python cache/bytecode
**/__pycache__/
*.py[cod]
20 changes: 15 additions & 5 deletions METHODOLOGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,24 +20,34 @@
- Docker + Docker Compose for service orchestration
- Go parity runner (`cmd/parity-test`)
- shell scripts in `scripts/` for orchestration
- Python 3 report generator (`scripts/generate-report.py`)
- `hyperfine` benchmark engine (optional via `BENCH_ENGINE=hyperfine`)
- `benchstat` statistical comparison for quality gates
- policy file: `stats-policy.yaml`
- Python 3 report and normalization tooling in `scripts/`

## Baseline benchmark profile

- warmup requests: 1000
- request threads: 12
- concurrent connections: 400
- run duration: 30s
- warmup requests: 100 (legacy engine path)
- benchmark requests per run: 300
- runs per target: 3 (median reported)

## Quality policy

- thresholds and required metrics are defined in `stats-policy.yaml`
- `make ci-benchmark-quality-check` enforces policy locally and in CI
- benchstat comparisons are evaluated against policy baseline framework (`baseline` by default)

## Reporting

- raw run outputs: `results/latest/raw/`
- normalized summary: `results/latest/summary.json`
- markdown report: `results/latest/report.md`
- quality summary: `results/latest/benchmark-quality-summary.json`
- optional tool artifacts: `results/latest/tooling/benchstat/*.txt`

## Interpretation guidance

- treat parity failures as correctness blockers, not performance regressions
- compare medians first, then inspect distribution variance
- use benchstat deltas and policy thresholds for pass/fail interpretation
- annotate environment drift (host type, CPU, memory, Docker version) in report notes
14 changes: 13 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check
.PHONY: benchmark benchmark-modkit benchmark-nestjs benchmark-baseline benchmark-wire benchmark-fx benchmark-do report test parity-check parity-check-modkit parity-check-nestjs benchmark-fingerprint-check benchmark-limits-check benchmark-manifest-check benchmark-stats-check benchmark-variance-check benchmark-benchstat-check ci-benchmark-quality-check

benchmark:
bash scripts/run-all.sh
Expand Down Expand Up @@ -44,3 +44,15 @@ benchmark-limits-check:

benchmark-manifest-check:
python3 scripts/environment-manifest.py check-manifest --file results/latest/environment.manifest.json

benchmark-stats-check:
python3 scripts/benchmark-quality-check.py stats-check

benchmark-variance-check:
python3 scripts/benchmark-quality-check.py variance-check

benchmark-benchstat-check:
python3 scripts/benchmark-quality-check.py benchstat-check

ci-benchmark-quality-check:
python3 scripts/benchmark-quality-check.py ci-check
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,22 @@ Run benchmark orchestration and generate a report:
```bash
make benchmark
make report
make ci-benchmark-quality-check
```

Use OSS measurement engine (optional):

```bash
BENCH_ENGINE=hyperfine make benchmark
```

## Tooling prerequisites

- Go (for `go test` and `benchstat`)
- Python 3
- hyperfine (optional benchmark engine)
- benchstat (`go install golang.org/x/perf/cmd/benchstat@latest`)

## Repository layout

```text
Expand Down
10 changes: 6 additions & 4 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,15 @@ results/latest/ benchmark outputs and generated report

1. Launch target services
2. Run parity checks per target
3. Run load benchmarks for parity-passing targets
4. Save raw outputs
5. Build `summary.json`
6. Generate `report.md`
3. Run load benchmarks for parity-passing targets (`legacy` engine or `hyperfine`)
4. Normalize and save raw outputs
5. Run policy quality gates (`stats-policy.yaml` + benchstat)
6. Build `summary.json`
7. Generate `report.md`

## Failure model

- parity failures do not stop fixture file iteration; they aggregate and fail at the end
- benchmark runs should short-circuit per target if parity fails
- report generation should tolerate partial target results and mark skipped targets
- quality gate summary should always be emitted, including all-skipped smoke runs
138 changes: 138 additions & 0 deletions docs/design/003-benchmark-statistics-oss-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Design 003: OSS-Based Benchmark Statistics and CI Quality Gates

## 1. Goal
Replace custom statistical processing logic with OSS benchmark/statistics tooling while preserving repository-specific benchmark orchestration, parity-first gating, and artifact contracts.

**Why:**
- OSS tools reduce maintenance burden and improve methodological confidence.
- Statistical quality rules should be explicit, versioned, and enforced consistently in local runs and CI.
- Existing result/report paths should stay stable to avoid breaking downstream workflows.

## 2. Scope
In:
1. Integrate `hyperfine` as the benchmark measurement engine.
2. Integrate `benchstat` for statistical pass/fail checks.
3. Add a versioned policy file (`stats-policy.yaml`) for thresholds and rules.
4. Keep `results/latest/*` artifacts stable for summary/report consumers.
5. Update docs to reflect the new measurement and quality gate model.

Out:
1. Replacing parity contract behavior or matcher semantics.
2. Dashboard or visualization redesign.
3. Framework-specific performance tuning unrelated to methodology.

## 3. Non-Goals
1. Rewriting the benchmark orchestrator from scratch.
2. Removing all custom scripts (thin adapters/orchestration remain expected).
3. Changing issue-driven CI policy outside benchmark quality gates.

## 4. Architecture

### 4.1. Target Flow
1. **Parity Gate (existing behavior):** health check -> parity check per target.
2. **Measurement:** run benchmark samples using `hyperfine`.
3. **Normalization:** transform tool-native output into repo raw schema.
4. **Quality Gate:** run `benchstat`-based policy checks.
5. **Reporting:** generate `summary.json` and `report.md` from normalized artifacts.

### 4.2. What Remains Custom
1. Framework matrix orchestration and target routing.
2. Parity-first skip behavior and skip reason recording.
3. Artifact shaping for `results/latest/raw/*.json` and report pipeline compatibility.

### 4.3. What Moves to OSS
1. Run scheduling/statistical sampling mechanics -> `hyperfine`.
2. Statistical comparison/significance logic -> `benchstat`.

## 5. Policy Design

### 5.1. `stats-policy.yaml` (single source of truth)
Policy fields:
- significance (`alpha`), default `0.05`
- minimum run count per target
- regression thresholds by metric (percent-based)
- required metrics (must exist in normalized artifact)
- skip handling rules (`skipped` targets do not fail run by themselves)

### 5.2. Policy Enforcement Rules
1. No pass/fail decision without policy file.
2. Local and CI commands must use the same policy file.
3. Violations must emit actionable diagnostics per framework/metric.
4. Quality summary output is mandatory even when all targets are skipped.

## 6. Artifact Contract

### 6.1. Stable Artifacts (must remain)
- `results/latest/raw/*.json`
- `results/latest/summary.json`
- `results/latest/report.md`
- `results/latest/benchmark-quality-summary.json`

### 6.2. Optional Tool-Native Artifacts
- `results/latest/tooling/hyperfine/*.json`
- `results/latest/tooling/benchstat/*.txt`

## 7. CI Design
CI keeps `make ci-benchmark-quality-check` as the primary gate and:
1. runs benchmark pipeline,
2. runs quality policy check,
3. uploads benchmark quality summary and tool-native outputs as artifacts,
4. fails only on policy violations (not on expected parity/health skips).

## 8. Migration Plan

### Phase A: Policy + Interfaces
1. Add `stats-policy.yaml`.
2. Define normalized schema compatibility contract.
3. Add adapter interfaces without changing default execution path.

### Phase B: OSS Integration in Parallel
1. Add `BENCH_ENGINE=hyperfine` execution path.
2. Normalize tool output to existing raw schema.
3. Add `benchstat` gate in report-only mode.

### Phase C: Gate Cutover
1. Switch `ci-benchmark-quality-check` to policy-enforcing mode.
2. Keep artifact outputs and names stable.

### Phase D: Cleanup
1. Remove superseded custom variance/outlier math.
2. Retain thin orchestration and normalization glue only.

## 9. Risks and Mitigations
1. **Risk:** command-level timing differs from request-loop timing.
**Mitigation:** add a benchmark runner wrapper so each invocation is semantically consistent.
2. **Risk:** policy too strict causes unstable CI.
**Mitigation:** ramp from report-only to enforced mode after calibration window.
3. **Risk:** artifact drift breaks report tooling.
**Mitigation:** keep normalized schema contract stable and versioned.

## 10. Verification Plan
Required local verification before merge:
```bash
go test ./... -coverprofile=coverage.out -covermode=atomic
make benchmark
make report
make ci-benchmark-quality-check
```

Quality acceptance:
1. policy file is loaded and applied in local + CI runs,
2. quality summary artifact is generated every run,
3. parity-first skip semantics remain unchanged,
4. report generation remains deterministic from normalized artifacts.

## 11. Documentation Update Plan
The implementation for this design must include synchronized doc updates:
1. `METHODOLOGY.md` - replace custom-stat narrative with OSS toolchain and policy model.
2. `docs/guides/benchmark-workflow.md` - add required tools, execution flow, and artifact references.
3. `docs/architecture.md` - update performance plane and quality gate stage boundaries.
4. `README.md` - refresh quickstart/validation commands and tool prerequisites.
5. `docs/design/003-benchmark-statistics-oss-migration.md` - keep as design source of truth.

## 12. Rollback Strategy
If OSS migration introduces instability:
1. toggle back to legacy engine path,
2. keep parity and artifact generation operational,
3. continue emitting quality summary with explicit `mode: legacy` marker,
4. re-enable OSS path after threshold recalibration.
22 changes: 22 additions & 0 deletions docs/guides/benchmark-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

- targets available locally or via Docker Compose
- parity contract fixtures up to date
- benchmark quality tools installed locally:
- `hyperfine` (for `BENCH_ENGINE=hyperfine`)
- `benchstat` (`go install golang.org/x/perf/cmd/benchstat@latest`)

## Standard run

Expand All @@ -21,6 +24,12 @@ make benchmark-nestjs

Per-target runs also emit `results/latest/environment.fingerprint.json` and `results/latest/environment.manifest.json`.

Optional OSS measurement engine:

```bash
BENCH_ENGINE=hyperfine make benchmark
```

## Docker resource limits

Framework services use shared default limits from `docker-compose.yml`:
Expand All @@ -45,6 +54,19 @@ Benchmark scripts must run parity first for each target. If parity fails, skip b
- `results/latest/environment.manifest.json` - timestamped runner metadata and result index
- `results/latest/summary.json` - normalized summary
- `results/latest/report.md` - markdown report
- `results/latest/benchmark-quality-summary.json` - policy quality gate output
- `results/latest/tooling/benchstat/*.txt` - benchstat comparison outputs

## Quality checks

```bash
make benchmark-stats-check
make benchmark-variance-check
make benchmark-benchstat-check
make ci-benchmark-quality-check
```

Quality thresholds and required metrics are versioned in `stats-policy.yaml`.

## Reproducibility notes

Expand Down
Loading
Loading