Skip to content

docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59

Open
epappas wants to merge 1 commit into
mainfrom
doc-c04-outcome-honest-reconciliation
Open

docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59
epappas wants to merge 1 commit into
mainfrom
doc-c04-outcome-honest-reconciliation

Conversation

@epappas

@epappas epappas commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Fills in the C04 pre-registration's Outcome section with honest reconciliation: status INCOMPLETE, Q1+Q2 not measured, Q3 affirmed at 100% kept-rate, Q4 partial. Six of eight predicted outcomes marked NOT EVALUABLE because workload mismatch made the comparison structurally invalid through all eleven attempts.
  • Closes T-38, T-39, T-40, T-41 in TODO.md and opens P0 ticket T-42 (C04a attempt 12 — clean relaunch after T-41).
  • 437 CPU tests still passing — no code change in this PR, docs and backlog only.

What this PR is and isn't

This is the prediction-vs-reality reconciliation that the pre-registration discipline requires when a campaign doesn't reach its planned verdict. Eleven C04a launches between 2026-05-26 and 2026-05-27 produced ten integration-layer fixes (PRs #49-#58) but never produced a comparable goodput dataset against T-37's auto_tune baseline. The Outcome section now records that honestly, including the 3.05x cost overrun (~$9.77 actual vs $3.20 pre-reg estimate).

This is not a campaign re-launch and not an analysis writeup. The analysis writeup at docs/research/references/12-c04-outcome.md is gated on attempt 12 producing usable data — that is T-42's job.

What was actually achieved (citable)

What's still open

T-42 (P0, in `TODO.md`): C04a attempt 12 — clean relaunch on 2x A100 spot now that the workload matches T-37 Baseline B. Requires user GPU-spend authorization. Estimated $1.50-2.50.

Test plan

  • `uv run pytest -q -m "not gpu and not basilica"` — 437 passed in 8.69s
  • Pre-reg Outcome section now has Status, headline numbers, reconciliation table, per-Q analysis, bugs+fixes table, what's still open, cost actually spent, artifacts, next-session restart point, and closing
  • TODO.md: T-38/T-39/T-40/T-41 rows added to Closed; C04a attempts 1-11 row added to Closed; T-42 row added to P0
  • No code changes; no GPU spend

…irmed

Eleven C04a launch attempts between 2026-05-26 and 2026-05-27 produced
zero comparable goodput datasets but landed ten integration-layer fixes
(PRs #49-#58). The pre-reg's Outcome section is filled in with honest
reconciliation:

- Status: INCOMPLETE
- Q1 (C04a 2-knob surrogate vs grid): NOT MEASURED — workload mismatch
  (driver hardcoded 128/64 vs T-37 Baseline B's 256/20) persisted until
  T-41 landed in attempt 11 budget
- Q2 (C04b full surface): NOT LAUNCHED
- Q3 (T-26c kept-rate >=30%): AFFIRMED at 100% (20/20) at attempt 7 —
  T-26c surrogate validated under real serving (caveat: gate KL ceiling
  auto-calibrated wider than configured 2.0 under shared-GPU contention)
- Q4 (goodput-axis wiring): partial — wiring correct, values 0.0 because
  workload blocked any rate from meeting SLO

Six of eight predicted outcomes marked NOT EVALUABLE — the workload
mismatch made the comparison structurally invalid through all eleven
attempts.

Total spend ~\$9.77 vs \$3.20 pre-reg estimate (3.05x overrun); concentrated
in fixes that turned out necessary for ANY C04-shape comparison.

TODO.md updated:
- New P0 ticket T-42 — C04a attempt 12 (clean relaunch after T-41)
- Closed rows for T-38, T-39, T-40, T-41, and the C04a attempts 1-11
  chain referencing PRs #49-#58

437 CPU tests still passing.

Implements: P10 (pre-registration discipline; honest reconciliation
            when predictions don't match reality)
Evidence:   C04 pre-reg's outcome buckets and probability statements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant