docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59
Open
epappas wants to merge 1 commit into
Open
docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59epappas wants to merge 1 commit into
epappas wants to merge 1 commit into
Conversation
…irmed Eleven C04a launch attempts between 2026-05-26 and 2026-05-27 produced zero comparable goodput datasets but landed ten integration-layer fixes (PRs #49-#58). The pre-reg's Outcome section is filled in with honest reconciliation: - Status: INCOMPLETE - Q1 (C04a 2-knob surrogate vs grid): NOT MEASURED — workload mismatch (driver hardcoded 128/64 vs T-37 Baseline B's 256/20) persisted until T-41 landed in attempt 11 budget - Q2 (C04b full surface): NOT LAUNCHED - Q3 (T-26c kept-rate >=30%): AFFIRMED at 100% (20/20) at attempt 7 — T-26c surrogate validated under real serving (caveat: gate KL ceiling auto-calibrated wider than configured 2.0 under shared-GPU contention) - Q4 (goodput-axis wiring): partial — wiring correct, values 0.0 because workload blocked any rate from meeting SLO Six of eight predicted outcomes marked NOT EVALUABLE — the workload mismatch made the comparison structurally invalid through all eleven attempts. Total spend ~\$9.77 vs \$3.20 pre-reg estimate (3.05x overrun); concentrated in fixes that turned out necessary for ANY C04-shape comparison. TODO.md updated: - New P0 ticket T-42 — C04a attempt 12 (clean relaunch after T-41) - Closed rows for T-38, T-39, T-40, T-41, and the C04a attempts 1-11 chain referencing PRs #49-#58 437 CPU tests still passing. Implements: P10 (pre-registration discipline; honest reconciliation when predictions don't match reality) Evidence: C04 pre-reg's outcome buckets and probability statements
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TODO.mdand opens P0 ticket T-42 (C04a attempt 12 — clean relaunch after T-41).What this PR is and isn't
This is the prediction-vs-reality reconciliation that the pre-registration discipline requires when a campaign doesn't reach its planned verdict. Eleven C04a launches between 2026-05-26 and 2026-05-27 produced ten integration-layer fixes (PRs #49-#58) but never produced a comparable goodput dataset against T-37's auto_tune baseline. The Outcome section now records that honestly, including the 3.05x cost overrun (~$9.77 actual vs $3.20 pre-reg estimate).
This is not a campaign re-launch and not an analysis writeup. The analysis writeup at
docs/research/references/12-c04-outcome.mdis gated on attempt 12 producing usable data — that is T-42's job.What was actually achieved (citable)
What's still open
T-42 (P0, in `TODO.md`): C04a attempt 12 — clean relaunch on 2x A100 spot now that the workload matches T-37 Baseline B. Requires user GPU-spend authorization. Estimated $1.50-2.50.
Test plan