From d3511520c811f54504b5357355801f007d65ef9a Mon Sep 17 00:00:00 2001
From: Evangelos Pappas <epappas@evalonlabs.com>
Date: Wed, 27 May 2026 17:53:07 +0200
Subject: [PATCH] =?UTF-8?q?docs(c04):=20honest=20reconciliation=20outcome?=
 =?UTF-8?q?=20=E2=80=94=20Q1+Q2=20not=20measured,=20Q3=20affirmed?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Eleven C04a launch attempts between 2026-05-26 and 2026-05-27 produced
zero comparable goodput datasets but landed ten integration-layer fixes
(PRs #49-#58). The pre-reg's Outcome section is filled in with honest
reconciliation:

- Status: INCOMPLETE
- Q1 (C04a 2-knob surrogate vs grid): NOT MEASURED — workload mismatch
  (driver hardcoded 128/64 vs T-37 Baseline B's 256/20) persisted until
  T-41 landed in attempt 11 budget
- Q2 (C04b full surface): NOT LAUNCHED
- Q3 (T-26c kept-rate >=30%): AFFIRMED at 100% (20/20) at attempt 7 —
  T-26c surrogate validated under real serving (caveat: gate KL ceiling
  auto-calibrated wider than configured 2.0 under shared-GPU contention)
- Q4 (goodput-axis wiring): partial — wiring correct, values 0.0 because
  workload blocked any rate from meeting SLO

Six of eight predicted outcomes marked NOT EVALUABLE — the workload
mismatch made the comparison structurally invalid through all eleven
attempts.

Total spend ~\$9.77 vs \$3.20 pre-reg estimate (3.05x overrun); concentrated
in fixes that turned out necessary for ANY C04-shape comparison.

TODO.md updated:
- New P0 ticket T-42 — C04a attempt 12 (clean relaunch after T-41)
- Closed rows for T-38, T-39, T-40, T-41, and the C04a attempts 1-11
  chain referencing PRs #49-#58

437 CPU tests still passing.

Implements: P10 (pre-registration discipline; honest reconciliation
            when predictions don't match reality)
Evidence:   C04 pre-reg's outcome buckets and probability statements
---
 TODO.md                                       |   6 +
 .../04-l1-autotune-comparable-2026-05-26.md   | 264 ++++++++++++++++--
 2 files changed, 249 insertions(+), 21 deletions(-)

diff --git a/TODO.md b/TODO.md
index b750fd8..2d245c7 100644
--- a/TODO.md
+++ b/TODO.md
@@ -21,6 +21,7 @@ Bands by priority:
 
 | ID | Item | Why it blocks | Reference |
 |---|---|---|---|
+| T-42 | C04a attempt 12 — clean relaunch after T-41 lands; matches T-37 Baseline B workload (256/20) on 2× A100 spot | C04 pre-reg's Q1 and Q2 remain unanswered after 11 attempts produced integration-layer fixes but no comparable goodput dataset. After PR #58 (T-41) landed `random_input_len`/`random_output_len` through the config, the C04a config now matches T-37 Baseline B exactly. Single capped relaunch with $2-2.50 budget is the right next step; if it produces a clean 20/20 kept dataset, C04b follows; if not, harness needs architectural work beyond per-bug incremental fixes. **Requires user GPU-spend authorization.** | `examples/c04a-l1-restricted/config.yaml` (random_input_len: 256, random_output_len: 20); pre-reg `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` |
 | T-30 | Verify vLLM's MLP path actually calls patched `SiluAndMul.forward_cuda` during serving | Kernel-level audit (2026-04-30) found Sonnet 4's silu_mul implementations are 3-100× SLOWER at kernel level, yet C03-S end-to-end silu_mul NOV pairs were ties (+0.10%). This strongly suggests vLLM bypasses our monkey-patch via a fused gemm+silu+mul path or `forward_native` dispatcher. Without this verification any silu_mul-related kernel claim is unsound. | `layers/l3_kernel/injector.py:_TARGET_BINDINGS` for SiluAndMul |
 | T-31 | Replace L3 mode='vllm' end-to-end paired-control with kernel-level paired-control for non-hot-path kernels | Kernel-level audit (2026-04-30) confirmed end-to-end serving tok/s on Qwen3-8B cannot expose 2-9× kernel-level rmsnorm speedups because rmsnorm is ~3-5% of compute and end-to-end noise is ±0.5-1%. The audit's microbench pattern (importlib + cudaEvent + median timing across multiple shapes) is the right primitive; bake it into the L3 adapter as `mode='kernel'` and use that for any non-hot-path Q1. **Must use three-way (PyTorch / vllm-native / NOV) primitive per T-32.** | `layers/l3_kernel/adapter.py` |
 | T-32 | Production-baseline gap: Sonnet 4's single-shot Triton rmsnorm loses to vLLM-native CUDA at every Qwen3-8B shape | Three-way audit (2026-04-30, `kernel_level_audit_v3_three_way.json`) shows vLLM-native CUDA rmsnorm is **1.06–1.65× faster than Sonnet 4's Triton at every Qwen3-8B shape on 1× A100 80GB PCIe**. The original v2 "2–9× faster than PyTorch" reading was a strawman comparison: production vLLM doesn't run unfused PyTorch. **Citable production-relevant claim is currently NEGATIVE for the rmsnorm surface.** Three options: (1) re-run with stronger code-emission model (Sonnet 4.5, GPT-5-codex, DeepSeek-Coder-33B); (2) add post-emission Triton autotune sweep (BLOCK_SIZE, num_warps, num_stages) before the production-baseline A/B; (3) target less-optimised kernel surfaces. Also: silu_mul three-way could not run at vllm 0.20.0 — `_custom_ops.silu_and_mul` AttributeError; needs binding-name fix before silu_mul claim is even comparable. | `kernel_level_audit_v3_three_way.json`, `layers/l3_kernel/proposer.py` |
@@ -74,3 +75,8 @@ Bands by priority:
 | T-37 | vLLM `auto_tune.sh` baseline on Basilica — three reference points captured | (this commit) | Per C04 recon's 2026-05-26 corrective addendum ("Both, in sequence" + H100 anchor). Three baselines captured on `vllm/vllm-openai:v0.21.0` (commit `ad7125a431e176d4161099480a66f0169609a690`), all via the SDK-orchestrated `scripts/run_auto_tune_baseline.py` (PR #41–#45): **Baseline A** (A100 spot, INPUT=1800/OUTPUT=20, no SLO) → max_num_seqs=256, max_num_batched_tokens=4096, **throughput=8.53 req/s**. **Baseline B** (A100 spot, INPUT=256/OUTPUT=20, 500 ms SLO) → 256/512, **goodput=21.39 req/s**, P99 E2EL=494.60 ms. **Baseline C** (H100 spot, INPUT=1800/OUTPUT=20, 500 ms SLO; the auto_tune README target) → 256/512, **goodput=2.97 req/s**, P99 E2EL=457.27 ms. Total cost ~$3.17 across all attempts + 3 final baselines. Three earlier-attempt fixes landed during T-37 (PRs #43/#44/#45): apt-install bc, rename cloned vllm source dir to avoid import shadow, sed-patch `auto_tune.sh` `HOSTNAME=$(hostname)` → `HOSTNAME=localhost`. Raw artifact + full per-cell grids + reproduction recipe in `docs/research/raw/auto_tune-baseline-2026-05-26.md`. |
 | T-29 | Paired-control prompt robustness — split into per-cell sequential calls | (PR #17) | `KernelProposer.propose_for_cells` now issues N separate per-cell LLM calls via `build_single_cell_kernel_prompt` instead of one batched 6-block paired prompt. C03-S validated: NOV-half failure rate dropped from C02's 2/6 (33%) to 1/9 (11%), at the pre-registered Outcome H threshold. |
 | Campaign 03 | A100 narrow-replication paired-control (1× A100 spot, OpenRouter Sonnet 4, T-26b + T-29 + 1-GPU mode) | run completed `8c2ef41`; pre-registration + outcome at `docs/research/campaigns/03-h100-replication-2026-04-27.md` | 60 trials in 144 min, ~$15 for the final S run + ~$30-40 in earlier failed attempts. **Q1 RETRACTED 2026-04-30 — twice.** First retraction (v2 audit, `kernel_level_audit_results.json`): end-to-end serving tok/s on Qwen3-8B is dominated by attention (rmsnorm ~3-5% of compute), so kernel-level speedups can't clear the ±0.5-1% noise floor. Sonnet's rmsnorm kernels measured 2-9× faster than PyTorch unfused reference. **v3 audit re-correction (`kernel_level_audit_v3_three_way.json`):** the v2 framing was a strawman — production vLLM runs `vllm._custom_ops.rms_norm` (a hand-tuned CUDA kernel), not unfused PyTorch. With three-way comparison (PyTorch / vllm-native / Sonnet), **vLLM-native CUDA is 1.06-1.65× FASTER than Sonnet 4's Triton at every Qwen3-8B shape**. Sonnet's Triton beats unfused PyTorch by 3-10× but loses to vLLM-native by 6-65%. **Production-relevant kernel-novelty speedup at the rmsnorm surface is NEGATIVE.** silu_mul 3-way couldn't run at vllm 0.20.0 (`_custom_ops.silu_and_mul` AttributeError); v2 finding (Sonnet's silu_mul Triton 3-100× slower than PyTorch unfused) stands but production-relative still unmeasured. Q2 partial (~20% L1 surrogate kept-rate; below 30% — opens T-26c). Q3 deferred. Q4 AFFIRMED — T-29 dropped NOV-half failure rate to 1/9. 7 PRs of latent bugs (#14-#21). Opens T-30 (silu_mul patch verification), T-31 (L3 kernel-level mode), **T-32 (production-baseline gap; three-way primitive required for any future kernel claim)**. |
+| T-38 | Driver rate-search mirroring `auto_tune.sh` rate-down algorithm | (PR #55) | `harness/driver.py:run_driver_with_rate_search` walks rate from `start_rate` down to `min_rate` in `step_size` halvings (mirroring `auto_tune.sh:34-67`); first rate whose P99 metrics all clear the goodput SLO wins. `L1EngineAdapter._run_benchmarks` selects this path when `driver_use_rate_search=True` (default) and a goodput SLO is configured; falls back to single-shot at `request_rate=inf` for legacy throughput-mode runs. Per-rate bench JSONs archived to `<artifact_dir>/<trial_id>/rate_<rate>.json`. Returns `(measurement, chosen_request_rate, per_rate_summaries)` so the per-trial JSON records both the chosen rate and the climb-down trace. 5 new tests in `test_driver.py`. |
+| T-39 | Process-group kill of candidate via `os.killpg` (candidate kept 74 GiB across trials at C04a attempt 9) | (PR #56) | `L1EngineAdapter._start_candidate` sets `start_new_session=True` on the `subprocess.Popen` call so candidate + EngineCore children share a process group. `_stop_candidate` calls `os.killpg(pgid, SIGTERM)` then escalates to `SIGKILL` after a 30s wait, with a 5s post-kill sleep for CUDA driver cleanup. Without this, a bench timeout would orphan the EngineCore child holding all the HBM, and the next trial's reference probe would see only 5.45/79 GiB free. 4 new tests in `test_l1_adapter.py`. |
+| T-40 | Explicit `--percentile-metrics ttft,tpot,itl,e2el` + bench timeout 1800→600s | (PR #57) | `build_bench_command` now always emits `--percentile-metrics ttft,tpot,itl,e2el` (vLLM bench's `--save-result` JSON otherwise omits `e2el` fields silently, breaking E2E-SLO goodput evaluation). `driver_timeout_s` default lowered from 1800 to 600 in `L1EngineAdapter` — at the C04 workload a single bench should complete in ~30-90s, so a 10-minute ceiling catches stuck runs without wasting trial budget. 3 new tests in `test_driver.py`. |
+| T-41 | Workload params (`random_input_len`/`random_output_len`) through DriverConfig → adapter → driver | (PR #58) | Driver was hardcoded to `random_input_len=128, random_output_len=64` while T-37 Baseline B used `256/20`. With `output_len=64` and a 500ms E2E SLO, decode time alone (~1920ms) exceeded SLO, blocking any rate from meeting the goodput target at C04a attempt 11. `DriverConfig.random_input_len: int = 128` and `random_output_len: int = 64` (Pydantic Field with ge=1); threaded through `L1EngineAdapter` (dataclass field, default 128/64), `_run_benchmarks` → `run_driver`/`run_driver_with_rate_search` → `build_bench_command`. Builder passes both from `cfg.harness.driver` to the L1 spec. `examples/c04a-l1-restricted/config.yaml` and `examples/c04b-l1-full/config.yaml` updated to `256`/`20`. 7 new tests across `test_config.py` (defaults, T-37 B values, ge=1), `test_driver.py` (CLI emission), `test_builder_joint.py` (builder threading). Total CPU tests: 437 passing. |
+| C04a attempts 1-11 | Eleven launch attempts of C04a between 2026-05-26 and 2026-05-27 — none reached a comparable goodput dataset; 10 integration-layer fixes landed; opens T-42 for the clean relaunch | (PRs #49-#58); pre-reg outcome at `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` Outcome section | Q1+Q2 NOT MEASURED — workload mismatch persisted until T-41 landed in attempt 11 budget. **Q3 AFFIRMED at 100% kept-rate (20/20)** at attempt 7, far above the 30% Outcome A3 threshold — T-26c L1 surrogate validated under real serving (caveat: under shared-GPU contention the gate's auto-calibrated KL ceiling was looser than the configured 2.0; tight-gate kept-rate awaits the 2-GPU re-run). Q4 partial — goodput-axis wiring correct but values were 0.0 because workload blocked any rate from meeting SLO. Total cost ~$9.77 vs $3.20 pre-reg estimate (3.05× overrun). Bugs fixed: `--image` flag (#49), `--model` flag (#50), reference replica max_model_len cap (#51), candidate max_model_len cap + stderr archival (#52), GMU inject parallel to clamp (#53), `--goodput` lowercase metric names (#54), T-38 rate-search (#55), T-39 process-group kill (#56), T-40 percentile-metrics + tighter timeout (#57), T-41 workload params threading (#58). |
diff --git a/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md b/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md
index 6fa5792..5b37d7b 100644
--- a/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md
+++ b/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md
@@ -466,49 +466,271 @@ spot. Decision deferred to post-C04 analysis.
 
 ---
 
-## Outcome (filled in after the run)
+## Outcome (filled in after the run — 2026-05-27)
 
-**Status:** PLANNED
+**Status:** **INCOMPLETE — Q1 and Q2 not measured; Q3 affirmed; Q4
+partial.** Eleven launch attempts of C04a between 2026-05-26 and
+2026-05-27. Each attempt surfaced a distinct integration-layer bug
+between autoinfer's harness and either the vLLM bench surface,
+Basilica's deployment model, or the candidate's process management.
+All bugs were real; all fixes landed on `main` with regression tests.
+None of the eleven attempts produced a comparable goodput dataset.
+
+The campaign cannot reach a verdict on Q1 (C04a surrogate vs grid on
+shared 2-knob surface) or Q2 (C04b wider surface vs grid) within the
+session's budget. Q3 (T-26c kept-rate) was incidentally affirmed
+during attempt 7. Q4 (goodput-axis wiring) is partially confirmed.
 
 ### Headline numbers
 
-(To be filled in.)
+| Item | Value |
+|---|---|
+| C04a goodput (Q1 target) | **NOT MEASURED** |
+| C04b goodput (Q2 target) | **NOT LAUNCHED** |
+| T-26c L1 surrogate kept-rate (Q3 target ≥30%) | **100% (20/20 trials)** — affirmed |
+| T-34 goodput-axis wiring (Q4) | partial — `objective_axis="goodput_req_per_sec"` correctly set in event log; per-trial JSON `extra["goodput_req_per_sec"]` populated; but every trial's value was 0.0 due to upstream workload mismatch (T-41) |
+| Total GPU spend | ~$9.90 across all 14 deployments (T-37: 3 final + 3 diagnostic = ~$3.17; C04a: 11 attempts = ~$6.70) |
+| Pre-reg estimated cost | $2 for C04a + $1.20 for C04b = $3.20 total. **3x budget overrun** on C04a alone, with no comparable measurement to show for it. |
 
 ### Reconciliation with predictions
 
+The pre-registration's outcome probabilities assumed the harness
+could measure goodput at all. None of the eight predicted outcomes
+can be evaluated against attempt-11's data because the workload the
+harness ran (`random_input_len=128, random_output_len=64`) didn't
+match T-37 Baseline B's workload (`256, 20`). The comparison surface
+was structurally invalid through all eleven attempts.
+
 | Prediction | Actual | Match? |
 |---|---|---|
-| Outcome A1 (C04a surrogate wins ≥10%, P=30%) | … | yes/no |
-| Outcome B1 (C04a tie ±10%, P=50%) | … | yes/no |
-| Outcome C1 (C04a surrogate loses, P=20%) | … | yes/no |
-| Outcome A2 (C04b wider wins, P=40%) | … | yes/no |
-| Outcome B2 (C04b tie, P=40%) | … | yes/no |
-| Outcome C2 (C04b wider loses, P=20%) | … | yes/no |
-| Outcome A3 (kept-rate ≥30%, P=55%) | … | yes/no |
-| Outcome A4 (goodput axis wired correctly, P=90%) | … | yes/no |
+| Outcome A1 (C04a surrogate wins ≥10%, P=30%) | **NOT EVALUABLE** — workload mismatch | n/a |
+| Outcome B1 (C04a tie ±10%, P=50%) | **NOT EVALUABLE** | n/a |
+| Outcome C1 (C04a surrogate loses, P=20%) | **NOT EVALUABLE** | n/a |
+| Outcome A2 (C04b wider wins, P=40%) | **NOT EVALUABLE** — C04b never launched | n/a |
+| Outcome B2 (C04b tie, P=40%) | **NOT EVALUABLE** | n/a |
+| Outcome C2 (C04b wider loses, P=20%) | **NOT EVALUABLE** | n/a |
+| Outcome A3 (kept-rate ≥30%, P=55%) | **100% kept (20/20)** at attempt 7 — Q3 AFFIRMED at the upper bound. T-26c's per-FailureKind classifier is doing its job. | YES |
+| Outcome A4 (goodput axis wired correctly, P=90%) | PARTIAL — axis flip + event log + per-trial field populated correctly, but goodput values were 0.0 throughout due to workload mismatch | partial |
+
+The honest read: the pre-reg's probability distribution assumed
+solving the comparable measurement was the experiment. It wasn't.
+The actual experiment turned out to be "discover and fix the
+integration-layer bugs blocking a comparable measurement." We
+finished that experiment with all eleven bugs identified and fixed,
+but no GPU-budget remained for the comparable measurement itself.
 
 ### What the data tells us about each Q
 
-(To be filled in.)
+**Q1 (C04a 2-knob surrogate vs grid):** No data. autoinfer ran an
+output_len=64 workload while T-37 ran output_len=20. The two-side
+P99 E2EL traces are not comparable. Cannot conclude anything about
+the surrogate's competitiveness with auto_tune's grid on the shared
+2-knob surface from this campaign.
+
+**Q2 (C04b 12-knob full surface vs grid):** Never launched. Pre-reg
+explicitly gates C04b on C04a producing a usable kept-rate; that
+condition was met at attempt 7 but the subsequent attempts focused
+on the goodput-comparable measurement path which never reached a
+clean state.
+
+**Q3 (T-26c L1 surrogate kept-rate ≥30%):** **Affirmed at 100%.**
+Attempt 7 (1-GPU mode with rate-search disabled, before T-38 landed)
+ran 20 trials with the per-FailureKind classifier active. Every
+trial passed the quality gate (zero startup or quality failures
+from the constrained-BO classifier's perspective). This validates
+T-26c's structural improvement over T-26b in a real serving
+environment. The campaign 03-S result (~20% kept-rate with T-26b)
+is decisively beaten.
+
+The caveat: the trial-acceptance criterion in this configuration
+was effectively just "KL gate passed" because the gate's KL ceiling
+was auto-calibrated up to ~16 (from the configured 2.0) due to
+noisy reference output under shared-GPU contention. The 100%
+kept-rate is real signal about T-26c's selection behavior — every
+surrogate-proposed config booted, ran a bench, and produced
+measurements — but it doesn't speak to the gate's *quality*
+discrimination, only to the surrogate's *feasibility* discrimination.
+A clean 2-GPU re-run (per attempts 8-11 setup) is needed to confirm
+kept-rate under the strict gate.
+
+**Q4 (T-34 goodput-axis wiring):** Partial. The runner's
+`objective_axis` correctly switched to `goodput_req_per_sec` when
+`slo_e2e_p99_ms` was set; the `config_loaded` event surfaced the
+SLO block; per-trial JSONs include `extra["goodput_req_per_sec"]`
+and `extra["chosen_request_rate"]` (after T-38). But the values
+were always 0.0 because the workload (T-41) blocked any rate from
+meeting SLO. The wiring is correct; the inputs to it were wrong.
 
 ### Bugs surfaced and their fixes
 
-(To be filled in.)
+Each attempt produced one or more PRs of real engineering fixes,
+each with regression tests:
+
+| Attempt | Failure mode | Fix PR | Cost (~$) |
+|---|---|---|---|
+| 1 | Deployment URL DNS never resolved (Basilica provisioning) | — (retry policy in orchestrator) | 0.15 |
+| 2 | `vllm/vllm-openai:latest` floating tag drift risk | #49 (`--image` flag) | 0.10 |
+| 3 | Reference replica ran Qwen3-8B instead of Llama-3.1-8B | #50 (`--model` flag) | 0.10 |
+| 4 | Reference replica OOM at 131k KV-cache alloc | #51 (`--max-model-len 4096` for reference) | 0.10 |
+| 5 | Candidate OOM at 131k KV-cache alloc + truncated stderr | #52 (env-var max_model_len inject + per-trial stderr archive) | 0.15 |
+| 6 | GMU clamp didn't inject default when catalog omits the knob | #53 (parallel inject helper) | 0.15 |
+| 7 | `--goodput` rejected at uppercase metric names | #54 (lowercase translation `TTFT→ttft`, `TPOT→tpot`, `E2E→e2el`) | 0.50 |
+| 8 | Driver fired bench once at `rate=inf` (queue-saturated → goodput=0) | #55 (T-38 rate-down search mirroring auto_tune.sh) | 0.55 |
+| 9 | Candidate process tree not killed; EngineCore child kept 74 GiB | #56 (T-39 `start_new_session=True` + `os.killpg`) | 1.20 |
+| 10 | `--save-result` JSON omitted `e2el` percentile fields | #57 (T-40 explicit `--percentile-metrics ttft,tpot,itl,e2el` + tighter timeout) | 1.20 |
+| 11 | `random_input_len`/`random_output_len` not threaded through config | #58 (T-41 DriverConfig fields + adapter + builder + tests) | 1.40 |
+
+Cumulative cost across the C04a attempt chain: ~$5.60. Plus T-37
+diagnostic + final baseline runs: ~$3.17. Plus the C04 pre-reg's
+"$2 ceiling, one more capped attempt" final run: ~$1.40. Total
+session GPU spend: ~$10. Pre-reg budget: $3.20. **Cost overrun: 3x.**
+
+In each case the bug was a genuine integration-layer issue that
+would have blocked any future C04-shape comparison. None of the
+fixes were defensive over-engineering. The cumulative effect is that
+the harness's coupling to vLLM's actual bench surface is now stress-
+tested end-to-end; the next session that relaunches against this
+commit starts with all eleven layers verified.
 
 ### What's still open after this run
 
-(To be filled in.)
+**Operationally** (the experimental questions the campaign was
+designed to answer):
+
+- **Q1 + Q2 unresolved.** The 2-knob and full-surface goodput
+  comparisons against auto_tune's 21.39 req/s reference need a new
+  GPU run after T-41 landed. The C04a config now points at the
+  correct workload (256/20). Estimated cost for a clean attempt 12:
+  $1.50-2.50 on 2× A100 spot. **Requires user GPU-spend
+  authorization to relaunch.**
+
+**Methodologically** (issues identified but not addressed in this
+session):
+
+- **The structural confound between the two sides remains
+  imperfectly characterized.** auto_tune.sh runs `--load-format dummy`
+  (random weights); autoinfer must run real weights for the C9
+  quality gate. Both measure goodput on the same SLO and the
+  bottleneck is compute not weight access, but the asymmetry is
+  there. Pre-reg's methodology footnote noted this; the writeup
+  must keep that footnote alive.
+
+- **The gate's max_kl auto-calibration sensitivity.** The 100% kept-
+  rate observed at attempt 7 came partly from the gate's effective
+  KL ceiling being calibrated up to ~16 (from configured 2.0) under
+  shared-GPU contention. A 2-GPU re-run will produce a tighter
+  noise floor and a stricter gate; the kept-rate under that stricter
+  gate is the load-bearing T-26c validation, not the 100% from
+  attempt 7.
+
+- **The eleventh-attempt budget cap was a successful safeguard.**
+  The user's "$2 ceiling on one more attempt" was the right
+  discipline; without it we'd have spent the session in a
+  fix-and-rerun spiral. Future GPU-budgeted runs should pre-register
+  the cap as part of the launch plan, not discover it mid-run.
+
+**Pre-flight tickets opened by this session** (all closed on `main`
+via PRs #49-#58):
+
+- T-39 — candidate process-tree kill (PR #56)
+- T-40 — explicit `--percentile-metrics` + bench timeout reduction (PR #57)
+- T-41 — workload params through DriverConfig (PR #58)
+
+The orchestrator gained two `--image` and `--model` passthrough
+flags (PRs #49, #50). The bootstrap gained three Llama-class-aware
+sizing fixes (PRs #51, #52, #53). The driver gained the
+rate-search algorithm (PR #55, T-38) and the goodput case fix
+(PR #54).
 
 ### Cost actually spent
 
-(To be filled in.)
+| Item | Approx. ($) |
+|---|---|
+| T-37 baseline runs (3 final + 3 diagnostic attempts) | 3.17 |
+| C04a attempts 1-7 (environmental + algorithmic unblocks) | 1.95 |
+| C04a attempt 8 (first kept-but-zero-goodput dataset) | 0.55 |
+| C04a attempt 9 (rate-search + process-tree-kill discovery) | 1.20 |
+| C04a attempt 10 (percentile-metrics discovery) | 1.20 |
+| C04a attempt 11 (workload-mismatch discovery, budget-capped) | 1.40 |
+| OpenRouter Sonnet 4 (warmstart + operator LLM calls) | ~0.30 |
+| **Total session GPU + LLM API spend** | **~9.77** |
+
+Pre-reg estimate: $3.20 (C04a $2 + C04b $1.20).
+Actual: $9.77.
+**Overrun: 3.05×.**
+
+The overrun is concentrated in fixes that turned out to be
+necessary for ANY C04-shape comparison — not specific to this
+campaign's framing. The PR chain is now the cost-amortizable shared
+infrastructure for any future autoinfer-vs-vLLM comparison.
 
 ### Artifacts
 
-- `basilica-artifacts/c04a-<date>-<sha>/` (per-trial JSON,
-  `events.jsonl`, `hw_context.json`, `results.tsv`,
-  `run_summary.json`).
-- `basilica-artifacts/c04b-<date>-<sha>/` (same shape).
-- `docs/research/references/12-c04-outcome.md` (analysis writeup;
-  TBD after the run).
+Local artifact directories (one per attempt):
+
+- `basilica-artifacts/c04a-2026-05-26/` (attempt 2, DNS retry)
+- `basilica-artifacts/c04a-2026-05-26-attempt3/` (Llama model fix)
+- `basilica-artifacts/c04a-2026-05-26-attempt4/` (reference max_model_len)
+- `basilica-artifacts/c04a-2026-05-26-attempt5/` (stderr archival landed)
+- `basilica-artifacts/c04a-2026-05-26-attempt6/` (GMU inject)
+- `basilica-artifacts/c04a-2026-05-26-attempt7/` (first 20/20 kept, goodput=0)
+- `basilica-artifacts/c04a-2026-05-27-attempt8-2gpu/` (apples-to-apples 2-GPU)
+- `basilica-artifacts/c04a-2026-05-27-attempt9-ratesearch/` (rate-search + pgkill discovery)
+- `basilica-artifacts/c04a-2026-05-27-attempt10-pgkill/` (percentile-metrics discovery)
+- `basilica-artifacts/c04a-2026-05-27-attempt11-final/` (workload-mismatch discovery)
+
+Each contains: per-trial JSONs, per-rate bench JSONs (after T-38),
+per-trial candidate stderr logs (after PR #52), `hw_context.json`,
+`events.jsonl`, `results.tsv`, `run_summary.json`.
+
+The artifacts and the PR chain together are the citable record of
+the session. The Q1/Q2 verdict is not in these artifacts; it awaits
+a future run.
+
+### Next-session restart point
+
+A future agent picking this up should:
+
+1. Read `docs/research/notes/c04-framing-overview-2026-05-26.md`
+   (the plain-language framing, PR #47).
+2. Read this Outcome section.
+3. Confirm `main` is at PR #58 or later (T-41 landed).
+4. Verify `examples/c04a-l1-restricted/config.yaml` has
+   `random_input_len: 256` and `random_output_len: 20`.
+5. Launch attempt 12 with the standard 2-GPU command from the
+   pre-reg's "Launch commands" section. Expected wall ~2-3 h;
+   expected cost $1.50-2.50.
+6. If attempt 12 produces a clean 20/20 dataset with `goodput > 0`,
+   compare against T-37 Baseline B's 21.39 req/s. Then C04b.
+7. If attempt 12 surfaces a twelfth bug, **stop** — the harness
+   needs architectural work beyond per-bug incremental fixes.
+
+### Pre-reg discipline observation
+
+The pre-registration discipline did exactly what it was designed
+to do: it surfaced that the experiment didn't reach a verdict.
+Without the pre-reg's explicit prediction probabilities and outcome
+buckets, we might have written up "100% kept-rate, eight fixes
+landed" as a success. With it, we're forced to acknowledge that
+Q1 and Q2 are not yet answered — which is the truth.
+
+The cost overrun is the more useful surprise: ten unblock fixes
+were genuinely necessary, none gratuitous, and the harness's
+end-to-end integration with vLLM was much rougher than any prior
+audit had revealed. Future campaigns should budget for this kind
+of "first time we touched this code path" overhead even when
+individual changes look small.
+
+### Closing
+
+C04 is **paused, not abandoned.** The integration-layer foundations
+laid by this campaign are exactly what was missing from the
+autoinfer harness in prior sessions; the next campaign that
+re-enters this comparison surface should converge in 1-2 attempts,
+not eleven. The pre-reg's questions remain valid; we just need a
+clean attempt 12 with the now-correct workload params.
+
+A separate analysis writeup will be produced at
+`docs/research/references/12-c04-outcome.md` once attempt 12
+produces a clean dataset.
 - Closing commits: TBD.