docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed by epappas · Pull Request #59 · epappas/autoinfer

epappas · 2026-05-27T15:54:00Z

Summary

Fills in the C04 pre-registration's Outcome section with honest reconciliation: status INCOMPLETE, Q1+Q2 not measured, Q3 affirmed at 100% kept-rate, Q4 partial. Six of eight predicted outcomes marked NOT EVALUABLE because workload mismatch made the comparison structurally invalid through all eleven attempts.
Closes T-38, T-39, T-40, T-41 in TODO.md and opens P0 ticket T-42 (C04a attempt 12 — clean relaunch after T-41).
437 CPU tests still passing — no code change in this PR, docs and backlog only.

What this PR is and isn't

This is the prediction-vs-reality reconciliation that the pre-registration discipline requires when a campaign doesn't reach its planned verdict. Eleven C04a launches between 2026-05-26 and 2026-05-27 produced ten integration-layer fixes (PRs #49-#58) but never produced a comparable goodput dataset against T-37's auto_tune baseline. The Outcome section now records that honestly, including the 3.05x cost overrun (~$9.77 actual vs $3.20 pre-reg estimate).

This is not a campaign re-launch and not an analysis writeup. The analysis writeup at docs/research/references/12-c04-outcome.md is gated on attempt 12 producing usable data — that is T-42's job.

What was actually achieved (citable)

T-26c L1 surrogate validated at 100% kept-rate (20/20) at attempt 7 — far above the pre-reg's 30% Outcome A3 threshold. The per-FailureKind sub-classifier is doing its job under real serving. Caveat: under shared-GPU contention the gate's KL ceiling auto-calibrated wider than configured (effective ~16 vs configured 2.0); the tight-gate kept-rate awaits a 2-GPU re-run.
Ten integration-layer fixes landed, each with regression tests:
- PR fix(launch): expose --image flag in launch_joint_campaign.sh #49: `--image` flag in `launch_joint_campaign.sh`
- PR fix(launch): expose --model flag in launch_joint_campaign.sh #50: `--model` flag in `launch_joint_campaign.sh`
- PR fix(campaign-runner): cap reference vllm serve --max-model-len to 4096 #51: reference replica `--max-model-len 4096` cap
- PR fix(adapter): cap candidate max_model_len in 1-GPU mode + archive stderr #52: candidate `max_model_len` env-var + per-trial stderr archival
- PR fix(surface): inject --gpu-memory-utilization when env cap set + catalog omits GMU #53: GMU inject parallel to existing clamp
- PR fix(driver): lowercase --goodput metric names (ttft/tpot/e2el) for vLLM #54: `--goodput` lowercase metric names (TTFT to ttft, E2E to e2el)
- PR feat(driver): T-38 — rate-down search mirroring auto_tune.sh #55 (T-38): driver rate-search mirroring `auto_tune.sh` rate-down algorithm
- PR fix(adapter): T-39 — process-group kill for candidate vllm tree #56 (T-39): `os.killpg` process-group kill with SIGTERM to SIGKILL escalation
- PR fix(driver): T-40 — emit --percentile-metrics + tighten per-bench timeout #57 (T-40): explicit `--percentile-metrics ttft,tpot,itl,e2el` + 1800 to 600s timeout
- PR feat(driver): T-41 — thread random_input_len/random_output_len through config #58 (T-41): workload params (`random_input_len`/`random_output_len`) through DriverConfig

What's still open

T-42 (P0, in `TODO.md`): C04a attempt 12 — clean relaunch on 2x A100 spot now that the workload matches T-37 Baseline B. Requires user GPU-spend authorization. Estimated $1.50-2.50.

Test plan

`uv run pytest -q -m "not gpu and not basilica"` — 437 passed in 8.69s
Pre-reg Outcome section now has Status, headline numbers, reconciliation table, per-Q analysis, bugs+fixes table, what's still open, cost actually spent, artifacts, next-session restart point, and closing
TODO.md: T-38/T-39/T-40/T-41 rows added to Closed; C04a attempts 1-11 row added to Closed; T-42 row added to P0
No code changes; no GPU spend

…irmed Eleven C04a launch attempts between 2026-05-26 and 2026-05-27 produced zero comparable goodput datasets but landed ten integration-layer fixes (PRs #49-#58). The pre-reg's Outcome section is filled in with honest reconciliation: - Status: INCOMPLETE - Q1 (C04a 2-knob surrogate vs grid): NOT MEASURED — workload mismatch (driver hardcoded 128/64 vs T-37 Baseline B's 256/20) persisted until T-41 landed in attempt 11 budget - Q2 (C04b full surface): NOT LAUNCHED - Q3 (T-26c kept-rate >=30%): AFFIRMED at 100% (20/20) at attempt 7 — T-26c surrogate validated under real serving (caveat: gate KL ceiling auto-calibrated wider than configured 2.0 under shared-GPU contention) - Q4 (goodput-axis wiring): partial — wiring correct, values 0.0 because workload blocked any rate from meeting SLO Six of eight predicted outcomes marked NOT EVALUABLE — the workload mismatch made the comparison structurally invalid through all eleven attempts. Total spend ~\$9.77 vs \$3.20 pre-reg estimate (3.05x overrun); concentrated in fixes that turned out necessary for ANY C04-shape comparison. TODO.md updated: - New P0 ticket T-42 — C04a attempt 12 (clean relaunch after T-41) - Closed rows for T-38, T-39, T-40, T-41, and the C04a attempts 1-11 chain referencing PRs #49-#58 437 CPU tests still passing. Implements: P10 (pre-registration discipline; honest reconciliation when predictions don't match reality) Evidence: C04 pre-reg's outcome buckets and probability statements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59

docs(c04): honest reconciliation — Q1+Q2 not measured, Q3 affirmed#59
epappas wants to merge 1 commit into
mainfrom
doc-c04-outcome-honest-reconciliation

epappas commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented May 27, 2026

Summary

What this PR is and isn't

What was actually achieved (citable)

What's still open

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant