Context
parakeet-stt-daemon/check_model.py has grown into a large mixed-responsibility file. We should defer structural refactor until current worktree efforts are complete, but capture this as planned technical debt.
Problem
check_model.py currently mixes CLI parsing, benchmark case loading, runtime execution (offline + stream-seal), metrics, gating, baseline IO, and reporting.
- This raises maintenance risk and slows safe changes.
- We do not currently enforce broad checks to catch other files crossing practical complexity/size thresholds.
Goal
Refactor the benchmark harness into smaller modules with no behavior regressions, and add repo-wide checks to surface oversized files early.
Scope (later, not now)
- Split
check_model.py into focused modules (suggested):
benchmark/io.py (manifest/transcripts loading)
benchmark/metrics.py (WER/token/punctuation/thresholds)
benchmark/runtime.py (offline + stream-seal execution)
benchmark/reporting.py (JSON payload/baseline IO)
- Keep
check_model.py as thin CLI entrypoint.
- Add compatibility coverage so existing
just eval and check_model.py CLI usage remain unchanged.
- Add broader checks to detect large/complex files across repo:
- Size threshold check (lines per file), initially warn-only.
- Optional complexity indicators (function length/cyclomatic) where practical.
- Integrate in local quality flow (
prek) and/or CI as non-blocking first.
- Document policy and thresholds in harness docs.
Acceptance Criteria
- Behavior parity verified by existing benchmark harness tests plus added regression tests for refactor boundaries.
just eval flows remain stable (offline/stream compare + baseline calibrations).
- Repo-wide size check exists and reports offenders beyond configured thresholds.
- Documentation updated with rationale and maintenance path.
Notes
This issue is intentionally deferred until active worktree changes settle.
Context
parakeet-stt-daemon/check_model.pyhas grown into a large mixed-responsibility file. We should defer structural refactor until current worktree efforts are complete, but capture this as planned technical debt.Problem
check_model.pycurrently mixes CLI parsing, benchmark case loading, runtime execution (offline + stream-seal), metrics, gating, baseline IO, and reporting.Goal
Refactor the benchmark harness into smaller modules with no behavior regressions, and add repo-wide checks to surface oversized files early.
Scope (later, not now)
check_model.pyinto focused modules (suggested):benchmark/io.py(manifest/transcripts loading)benchmark/metrics.py(WER/token/punctuation/thresholds)benchmark/runtime.py(offline + stream-seal execution)benchmark/reporting.py(JSON payload/baseline IO)check_model.pyas thin CLI entrypoint.just evalandcheck_model.pyCLI usage remain unchanged.prek) and/or CI as non-blocking first.Acceptance Criteria
just evalflows remain stable (offline/stream compare + baseline calibrations).Notes
This issue is intentionally deferred until active worktree changes settle.