From d3511520c811f54504b5357355801f007d65ef9a Mon Sep 17 00:00:00 2001 From: Evangelos Pappas Date: Wed, 27 May 2026 17:53:07 +0200 Subject: [PATCH] =?UTF-8?q?docs(c04):=20honest=20reconciliation=20outcome?= =?UTF-8?q?=20=E2=80=94=20Q1+Q2=20not=20measured,=20Q3=20affirmed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Eleven C04a launch attempts between 2026-05-26 and 2026-05-27 produced zero comparable goodput datasets but landed ten integration-layer fixes (PRs #49-#58). The pre-reg's Outcome section is filled in with honest reconciliation: - Status: INCOMPLETE - Q1 (C04a 2-knob surrogate vs grid): NOT MEASURED — workload mismatch (driver hardcoded 128/64 vs T-37 Baseline B's 256/20) persisted until T-41 landed in attempt 11 budget - Q2 (C04b full surface): NOT LAUNCHED - Q3 (T-26c kept-rate >=30%): AFFIRMED at 100% (20/20) at attempt 7 — T-26c surrogate validated under real serving (caveat: gate KL ceiling auto-calibrated wider than configured 2.0 under shared-GPU contention) - Q4 (goodput-axis wiring): partial — wiring correct, values 0.0 because workload blocked any rate from meeting SLO Six of eight predicted outcomes marked NOT EVALUABLE — the workload mismatch made the comparison structurally invalid through all eleven attempts. Total spend ~\$9.77 vs \$3.20 pre-reg estimate (3.05x overrun); concentrated in fixes that turned out necessary for ANY C04-shape comparison. TODO.md updated: - New P0 ticket T-42 — C04a attempt 12 (clean relaunch after T-41) - Closed rows for T-38, T-39, T-40, T-41, and the C04a attempts 1-11 chain referencing PRs #49-#58 437 CPU tests still passing. Implements: P10 (pre-registration discipline; honest reconciliation when predictions don't match reality) Evidence: C04 pre-reg's outcome buckets and probability statements --- TODO.md | 6 + .../04-l1-autotune-comparable-2026-05-26.md | 264 ++++++++++++++++-- 2 files changed, 249 insertions(+), 21 deletions(-) diff --git a/TODO.md b/TODO.md index b750fd8..2d245c7 100644 --- a/TODO.md +++ b/TODO.md @@ -21,6 +21,7 @@ Bands by priority: | ID | Item | Why it blocks | Reference | |---|---|---|---| +| T-42 | C04a attempt 12 — clean relaunch after T-41 lands; matches T-37 Baseline B workload (256/20) on 2× A100 spot | C04 pre-reg's Q1 and Q2 remain unanswered after 11 attempts produced integration-layer fixes but no comparable goodput dataset. After PR #58 (T-41) landed `random_input_len`/`random_output_len` through the config, the C04a config now matches T-37 Baseline B exactly. Single capped relaunch with $2-2.50 budget is the right next step; if it produces a clean 20/20 kept dataset, C04b follows; if not, harness needs architectural work beyond per-bug incremental fixes. **Requires user GPU-spend authorization.** | `examples/c04a-l1-restricted/config.yaml` (random_input_len: 256, random_output_len: 20); pre-reg `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` | | T-30 | Verify vLLM's MLP path actually calls patched `SiluAndMul.forward_cuda` during serving | Kernel-level audit (2026-04-30) found Sonnet 4's silu_mul implementations are 3-100× SLOWER at kernel level, yet C03-S end-to-end silu_mul NOV pairs were ties (+0.10%). This strongly suggests vLLM bypasses our monkey-patch via a fused gemm+silu+mul path or `forward_native` dispatcher. Without this verification any silu_mul-related kernel claim is unsound. | `layers/l3_kernel/injector.py:_TARGET_BINDINGS` for SiluAndMul | | T-31 | Replace L3 mode='vllm' end-to-end paired-control with kernel-level paired-control for non-hot-path kernels | Kernel-level audit (2026-04-30) confirmed end-to-end serving tok/s on Qwen3-8B cannot expose 2-9× kernel-level rmsnorm speedups because rmsnorm is ~3-5% of compute and end-to-end noise is ±0.5-1%. The audit's microbench pattern (importlib + cudaEvent + median timing across multiple shapes) is the right primitive; bake it into the L3 adapter as `mode='kernel'` and use that for any non-hot-path Q1. **Must use three-way (PyTorch / vllm-native / NOV) primitive per T-32.** | `layers/l3_kernel/adapter.py` | | T-32 | Production-baseline gap: Sonnet 4's single-shot Triton rmsnorm loses to vLLM-native CUDA at every Qwen3-8B shape | Three-way audit (2026-04-30, `kernel_level_audit_v3_three_way.json`) shows vLLM-native CUDA rmsnorm is **1.06–1.65× faster than Sonnet 4's Triton at every Qwen3-8B shape on 1× A100 80GB PCIe**. The original v2 "2–9× faster than PyTorch" reading was a strawman comparison: production vLLM doesn't run unfused PyTorch. **Citable production-relevant claim is currently NEGATIVE for the rmsnorm surface.** Three options: (1) re-run with stronger code-emission model (Sonnet 4.5, GPT-5-codex, DeepSeek-Coder-33B); (2) add post-emission Triton autotune sweep (BLOCK_SIZE, num_warps, num_stages) before the production-baseline A/B; (3) target less-optimised kernel surfaces. Also: silu_mul three-way could not run at vllm 0.20.0 — `_custom_ops.silu_and_mul` AttributeError; needs binding-name fix before silu_mul claim is even comparable. | `kernel_level_audit_v3_three_way.json`, `layers/l3_kernel/proposer.py` | @@ -74,3 +75,8 @@ Bands by priority: | T-37 | vLLM `auto_tune.sh` baseline on Basilica — three reference points captured | (this commit) | Per C04 recon's 2026-05-26 corrective addendum ("Both, in sequence" + H100 anchor). Three baselines captured on `vllm/vllm-openai:v0.21.0` (commit `ad7125a431e176d4161099480a66f0169609a690`), all via the SDK-orchestrated `scripts/run_auto_tune_baseline.py` (PR #41–#45): **Baseline A** (A100 spot, INPUT=1800/OUTPUT=20, no SLO) → max_num_seqs=256, max_num_batched_tokens=4096, **throughput=8.53 req/s**. **Baseline B** (A100 spot, INPUT=256/OUTPUT=20, 500 ms SLO) → 256/512, **goodput=21.39 req/s**, P99 E2EL=494.60 ms. **Baseline C** (H100 spot, INPUT=1800/OUTPUT=20, 500 ms SLO; the auto_tune README target) → 256/512, **goodput=2.97 req/s**, P99 E2EL=457.27 ms. Total cost ~$3.17 across all attempts + 3 final baselines. Three earlier-attempt fixes landed during T-37 (PRs #43/#44/#45): apt-install bc, rename cloned vllm source dir to avoid import shadow, sed-patch `auto_tune.sh` `HOSTNAME=$(hostname)` → `HOSTNAME=localhost`. Raw artifact + full per-cell grids + reproduction recipe in `docs/research/raw/auto_tune-baseline-2026-05-26.md`. | | T-29 | Paired-control prompt robustness — split into per-cell sequential calls | (PR #17) | `KernelProposer.propose_for_cells` now issues N separate per-cell LLM calls via `build_single_cell_kernel_prompt` instead of one batched 6-block paired prompt. C03-S validated: NOV-half failure rate dropped from C02's 2/6 (33%) to 1/9 (11%), at the pre-registered Outcome H threshold. | | Campaign 03 | A100 narrow-replication paired-control (1× A100 spot, OpenRouter Sonnet 4, T-26b + T-29 + 1-GPU mode) | run completed `8c2ef41`; pre-registration + outcome at `docs/research/campaigns/03-h100-replication-2026-04-27.md` | 60 trials in 144 min, ~$15 for the final S run + ~$30-40 in earlier failed attempts. **Q1 RETRACTED 2026-04-30 — twice.** First retraction (v2 audit, `kernel_level_audit_results.json`): end-to-end serving tok/s on Qwen3-8B is dominated by attention (rmsnorm ~3-5% of compute), so kernel-level speedups can't clear the ±0.5-1% noise floor. Sonnet's rmsnorm kernels measured 2-9× faster than PyTorch unfused reference. **v3 audit re-correction (`kernel_level_audit_v3_three_way.json`):** the v2 framing was a strawman — production vLLM runs `vllm._custom_ops.rms_norm` (a hand-tuned CUDA kernel), not unfused PyTorch. With three-way comparison (PyTorch / vllm-native / Sonnet), **vLLM-native CUDA is 1.06-1.65× FASTER than Sonnet 4's Triton at every Qwen3-8B shape**. Sonnet's Triton beats unfused PyTorch by 3-10× but loses to vLLM-native by 6-65%. **Production-relevant kernel-novelty speedup at the rmsnorm surface is NEGATIVE.** silu_mul 3-way couldn't run at vllm 0.20.0 (`_custom_ops.silu_and_mul` AttributeError); v2 finding (Sonnet's silu_mul Triton 3-100× slower than PyTorch unfused) stands but production-relative still unmeasured. Q2 partial (~20% L1 surrogate kept-rate; below 30% — opens T-26c). Q3 deferred. Q4 AFFIRMED — T-29 dropped NOV-half failure rate to 1/9. 7 PRs of latent bugs (#14-#21). Opens T-30 (silu_mul patch verification), T-31 (L3 kernel-level mode), **T-32 (production-baseline gap; three-way primitive required for any future kernel claim)**. | +| T-38 | Driver rate-search mirroring `auto_tune.sh` rate-down algorithm | (PR #55) | `harness/driver.py:run_driver_with_rate_search` walks rate from `start_rate` down to `min_rate` in `step_size` halvings (mirroring `auto_tune.sh:34-67`); first rate whose P99 metrics all clear the goodput SLO wins. `L1EngineAdapter._run_benchmarks` selects this path when `driver_use_rate_search=True` (default) and a goodput SLO is configured; falls back to single-shot at `request_rate=inf` for legacy throughput-mode runs. Per-rate bench JSONs archived to `//rate_.json`. Returns `(measurement, chosen_request_rate, per_rate_summaries)` so the per-trial JSON records both the chosen rate and the climb-down trace. 5 new tests in `test_driver.py`. | +| T-39 | Process-group kill of candidate via `os.killpg` (candidate kept 74 GiB across trials at C04a attempt 9) | (PR #56) | `L1EngineAdapter._start_candidate` sets `start_new_session=True` on the `subprocess.Popen` call so candidate + EngineCore children share a process group. `_stop_candidate` calls `os.killpg(pgid, SIGTERM)` then escalates to `SIGKILL` after a 30s wait, with a 5s post-kill sleep for CUDA driver cleanup. Without this, a bench timeout would orphan the EngineCore child holding all the HBM, and the next trial's reference probe would see only 5.45/79 GiB free. 4 new tests in `test_l1_adapter.py`. | +| T-40 | Explicit `--percentile-metrics ttft,tpot,itl,e2el` + bench timeout 1800→600s | (PR #57) | `build_bench_command` now always emits `--percentile-metrics ttft,tpot,itl,e2el` (vLLM bench's `--save-result` JSON otherwise omits `e2el` fields silently, breaking E2E-SLO goodput evaluation). `driver_timeout_s` default lowered from 1800 to 600 in `L1EngineAdapter` — at the C04 workload a single bench should complete in ~30-90s, so a 10-minute ceiling catches stuck runs without wasting trial budget. 3 new tests in `test_driver.py`. | +| T-41 | Workload params (`random_input_len`/`random_output_len`) through DriverConfig → adapter → driver | (PR #58) | Driver was hardcoded to `random_input_len=128, random_output_len=64` while T-37 Baseline B used `256/20`. With `output_len=64` and a 500ms E2E SLO, decode time alone (~1920ms) exceeded SLO, blocking any rate from meeting the goodput target at C04a attempt 11. `DriverConfig.random_input_len: int = 128` and `random_output_len: int = 64` (Pydantic Field with ge=1); threaded through `L1EngineAdapter` (dataclass field, default 128/64), `_run_benchmarks` → `run_driver`/`run_driver_with_rate_search` → `build_bench_command`. Builder passes both from `cfg.harness.driver` to the L1 spec. `examples/c04a-l1-restricted/config.yaml` and `examples/c04b-l1-full/config.yaml` updated to `256`/`20`. 7 new tests across `test_config.py` (defaults, T-37 B values, ge=1), `test_driver.py` (CLI emission), `test_builder_joint.py` (builder threading). Total CPU tests: 437 passing. | +| C04a attempts 1-11 | Eleven launch attempts of C04a between 2026-05-26 and 2026-05-27 — none reached a comparable goodput dataset; 10 integration-layer fixes landed; opens T-42 for the clean relaunch | (PRs #49-#58); pre-reg outcome at `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` Outcome section | Q1+Q2 NOT MEASURED — workload mismatch persisted until T-41 landed in attempt 11 budget. **Q3 AFFIRMED at 100% kept-rate (20/20)** at attempt 7, far above the 30% Outcome A3 threshold — T-26c L1 surrogate validated under real serving (caveat: under shared-GPU contention the gate's auto-calibrated KL ceiling was looser than the configured 2.0; tight-gate kept-rate awaits the 2-GPU re-run). Q4 partial — goodput-axis wiring correct but values were 0.0 because workload blocked any rate from meeting SLO. Total cost ~$9.77 vs $3.20 pre-reg estimate (3.05× overrun). Bugs fixed: `--image` flag (#49), `--model` flag (#50), reference replica max_model_len cap (#51), candidate max_model_len cap + stderr archival (#52), GMU inject parallel to clamp (#53), `--goodput` lowercase metric names (#54), T-38 rate-search (#55), T-39 process-group kill (#56), T-40 percentile-metrics + tighter timeout (#57), T-41 workload params threading (#58). | diff --git a/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md b/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md index 6fa5792..5b37d7b 100644 --- a/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md +++ b/docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md @@ -466,49 +466,271 @@ spot. Decision deferred to post-C04 analysis. --- -## Outcome (filled in after the run) +## Outcome (filled in after the run — 2026-05-27) -**Status:** PLANNED +**Status:** **INCOMPLETE — Q1 and Q2 not measured; Q3 affirmed; Q4 +partial.** Eleven launch attempts of C04a between 2026-05-26 and +2026-05-27. Each attempt surfaced a distinct integration-layer bug +between autoinfer's harness and either the vLLM bench surface, +Basilica's deployment model, or the candidate's process management. +All bugs were real; all fixes landed on `main` with regression tests. +None of the eleven attempts produced a comparable goodput dataset. + +The campaign cannot reach a verdict on Q1 (C04a surrogate vs grid on +shared 2-knob surface) or Q2 (C04b wider surface vs grid) within the +session's budget. Q3 (T-26c kept-rate) was incidentally affirmed +during attempt 7. Q4 (goodput-axis wiring) is partially confirmed. ### Headline numbers -(To be filled in.) +| Item | Value | +|---|---| +| C04a goodput (Q1 target) | **NOT MEASURED** | +| C04b goodput (Q2 target) | **NOT LAUNCHED** | +| T-26c L1 surrogate kept-rate (Q3 target ≥30%) | **100% (20/20 trials)** — affirmed | +| T-34 goodput-axis wiring (Q4) | partial — `objective_axis="goodput_req_per_sec"` correctly set in event log; per-trial JSON `extra["goodput_req_per_sec"]` populated; but every trial's value was 0.0 due to upstream workload mismatch (T-41) | +| Total GPU spend | ~$9.90 across all 14 deployments (T-37: 3 final + 3 diagnostic = ~$3.17; C04a: 11 attempts = ~$6.70) | +| Pre-reg estimated cost | $2 for C04a + $1.20 for C04b = $3.20 total. **3x budget overrun** on C04a alone, with no comparable measurement to show for it. | ### Reconciliation with predictions +The pre-registration's outcome probabilities assumed the harness +could measure goodput at all. None of the eight predicted outcomes +can be evaluated against attempt-11's data because the workload the +harness ran (`random_input_len=128, random_output_len=64`) didn't +match T-37 Baseline B's workload (`256, 20`). The comparison surface +was structurally invalid through all eleven attempts. + | Prediction | Actual | Match? | |---|---|---| -| Outcome A1 (C04a surrogate wins ≥10%, P=30%) | … | yes/no | -| Outcome B1 (C04a tie ±10%, P=50%) | … | yes/no | -| Outcome C1 (C04a surrogate loses, P=20%) | … | yes/no | -| Outcome A2 (C04b wider wins, P=40%) | … | yes/no | -| Outcome B2 (C04b tie, P=40%) | … | yes/no | -| Outcome C2 (C04b wider loses, P=20%) | … | yes/no | -| Outcome A3 (kept-rate ≥30%, P=55%) | … | yes/no | -| Outcome A4 (goodput axis wired correctly, P=90%) | … | yes/no | +| Outcome A1 (C04a surrogate wins ≥10%, P=30%) | **NOT EVALUABLE** — workload mismatch | n/a | +| Outcome B1 (C04a tie ±10%, P=50%) | **NOT EVALUABLE** | n/a | +| Outcome C1 (C04a surrogate loses, P=20%) | **NOT EVALUABLE** | n/a | +| Outcome A2 (C04b wider wins, P=40%) | **NOT EVALUABLE** — C04b never launched | n/a | +| Outcome B2 (C04b tie, P=40%) | **NOT EVALUABLE** | n/a | +| Outcome C2 (C04b wider loses, P=20%) | **NOT EVALUABLE** | n/a | +| Outcome A3 (kept-rate ≥30%, P=55%) | **100% kept (20/20)** at attempt 7 — Q3 AFFIRMED at the upper bound. T-26c's per-FailureKind classifier is doing its job. | YES | +| Outcome A4 (goodput axis wired correctly, P=90%) | PARTIAL — axis flip + event log + per-trial field populated correctly, but goodput values were 0.0 throughout due to workload mismatch | partial | + +The honest read: the pre-reg's probability distribution assumed +solving the comparable measurement was the experiment. It wasn't. +The actual experiment turned out to be "discover and fix the +integration-layer bugs blocking a comparable measurement." We +finished that experiment with all eleven bugs identified and fixed, +but no GPU-budget remained for the comparable measurement itself. ### What the data tells us about each Q -(To be filled in.) +**Q1 (C04a 2-knob surrogate vs grid):** No data. autoinfer ran an +output_len=64 workload while T-37 ran output_len=20. The two-side +P99 E2EL traces are not comparable. Cannot conclude anything about +the surrogate's competitiveness with auto_tune's grid on the shared +2-knob surface from this campaign. + +**Q2 (C04b 12-knob full surface vs grid):** Never launched. Pre-reg +explicitly gates C04b on C04a producing a usable kept-rate; that +condition was met at attempt 7 but the subsequent attempts focused +on the goodput-comparable measurement path which never reached a +clean state. + +**Q3 (T-26c L1 surrogate kept-rate ≥30%):** **Affirmed at 100%.** +Attempt 7 (1-GPU mode with rate-search disabled, before T-38 landed) +ran 20 trials with the per-FailureKind classifier active. Every +trial passed the quality gate (zero startup or quality failures +from the constrained-BO classifier's perspective). This validates +T-26c's structural improvement over T-26b in a real serving +environment. The campaign 03-S result (~20% kept-rate with T-26b) +is decisively beaten. + +The caveat: the trial-acceptance criterion in this configuration +was effectively just "KL gate passed" because the gate's KL ceiling +was auto-calibrated up to ~16 (from the configured 2.0) due to +noisy reference output under shared-GPU contention. The 100% +kept-rate is real signal about T-26c's selection behavior — every +surrogate-proposed config booted, ran a bench, and produced +measurements — but it doesn't speak to the gate's *quality* +discrimination, only to the surrogate's *feasibility* discrimination. +A clean 2-GPU re-run (per attempts 8-11 setup) is needed to confirm +kept-rate under the strict gate. + +**Q4 (T-34 goodput-axis wiring):** Partial. The runner's +`objective_axis` correctly switched to `goodput_req_per_sec` when +`slo_e2e_p99_ms` was set; the `config_loaded` event surfaced the +SLO block; per-trial JSONs include `extra["goodput_req_per_sec"]` +and `extra["chosen_request_rate"]` (after T-38). But the values +were always 0.0 because the workload (T-41) blocked any rate from +meeting SLO. The wiring is correct; the inputs to it were wrong. ### Bugs surfaced and their fixes -(To be filled in.) +Each attempt produced one or more PRs of real engineering fixes, +each with regression tests: + +| Attempt | Failure mode | Fix PR | Cost (~$) | +|---|---|---|---| +| 1 | Deployment URL DNS never resolved (Basilica provisioning) | — (retry policy in orchestrator) | 0.15 | +| 2 | `vllm/vllm-openai:latest` floating tag drift risk | #49 (`--image` flag) | 0.10 | +| 3 | Reference replica ran Qwen3-8B instead of Llama-3.1-8B | #50 (`--model` flag) | 0.10 | +| 4 | Reference replica OOM at 131k KV-cache alloc | #51 (`--max-model-len 4096` for reference) | 0.10 | +| 5 | Candidate OOM at 131k KV-cache alloc + truncated stderr | #52 (env-var max_model_len inject + per-trial stderr archive) | 0.15 | +| 6 | GMU clamp didn't inject default when catalog omits the knob | #53 (parallel inject helper) | 0.15 | +| 7 | `--goodput` rejected at uppercase metric names | #54 (lowercase translation `TTFT→ttft`, `TPOT→tpot`, `E2E→e2el`) | 0.50 | +| 8 | Driver fired bench once at `rate=inf` (queue-saturated → goodput=0) | #55 (T-38 rate-down search mirroring auto_tune.sh) | 0.55 | +| 9 | Candidate process tree not killed; EngineCore child kept 74 GiB | #56 (T-39 `start_new_session=True` + `os.killpg`) | 1.20 | +| 10 | `--save-result` JSON omitted `e2el` percentile fields | #57 (T-40 explicit `--percentile-metrics ttft,tpot,itl,e2el` + tighter timeout) | 1.20 | +| 11 | `random_input_len`/`random_output_len` not threaded through config | #58 (T-41 DriverConfig fields + adapter + builder + tests) | 1.40 | + +Cumulative cost across the C04a attempt chain: ~$5.60. Plus T-37 +diagnostic + final baseline runs: ~$3.17. Plus the C04 pre-reg's +"$2 ceiling, one more capped attempt" final run: ~$1.40. Total +session GPU spend: ~$10. Pre-reg budget: $3.20. **Cost overrun: 3x.** + +In each case the bug was a genuine integration-layer issue that +would have blocked any future C04-shape comparison. None of the +fixes were defensive over-engineering. The cumulative effect is that +the harness's coupling to vLLM's actual bench surface is now stress- +tested end-to-end; the next session that relaunches against this +commit starts with all eleven layers verified. ### What's still open after this run -(To be filled in.) +**Operationally** (the experimental questions the campaign was +designed to answer): + +- **Q1 + Q2 unresolved.** The 2-knob and full-surface goodput + comparisons against auto_tune's 21.39 req/s reference need a new + GPU run after T-41 landed. The C04a config now points at the + correct workload (256/20). Estimated cost for a clean attempt 12: + $1.50-2.50 on 2× A100 spot. **Requires user GPU-spend + authorization to relaunch.** + +**Methodologically** (issues identified but not addressed in this +session): + +- **The structural confound between the two sides remains + imperfectly characterized.** auto_tune.sh runs `--load-format dummy` + (random weights); autoinfer must run real weights for the C9 + quality gate. Both measure goodput on the same SLO and the + bottleneck is compute not weight access, but the asymmetry is + there. Pre-reg's methodology footnote noted this; the writeup + must keep that footnote alive. + +- **The gate's max_kl auto-calibration sensitivity.** The 100% kept- + rate observed at attempt 7 came partly from the gate's effective + KL ceiling being calibrated up to ~16 (from configured 2.0) under + shared-GPU contention. A 2-GPU re-run will produce a tighter + noise floor and a stricter gate; the kept-rate under that stricter + gate is the load-bearing T-26c validation, not the 100% from + attempt 7. + +- **The eleventh-attempt budget cap was a successful safeguard.** + The user's "$2 ceiling on one more attempt" was the right + discipline; without it we'd have spent the session in a + fix-and-rerun spiral. Future GPU-budgeted runs should pre-register + the cap as part of the launch plan, not discover it mid-run. + +**Pre-flight tickets opened by this session** (all closed on `main` +via PRs #49-#58): + +- T-39 — candidate process-tree kill (PR #56) +- T-40 — explicit `--percentile-metrics` + bench timeout reduction (PR #57) +- T-41 — workload params through DriverConfig (PR #58) + +The orchestrator gained two `--image` and `--model` passthrough +flags (PRs #49, #50). The bootstrap gained three Llama-class-aware +sizing fixes (PRs #51, #52, #53). The driver gained the +rate-search algorithm (PR #55, T-38) and the goodput case fix +(PR #54). ### Cost actually spent -(To be filled in.) +| Item | Approx. ($) | +|---|---| +| T-37 baseline runs (3 final + 3 diagnostic attempts) | 3.17 | +| C04a attempts 1-7 (environmental + algorithmic unblocks) | 1.95 | +| C04a attempt 8 (first kept-but-zero-goodput dataset) | 0.55 | +| C04a attempt 9 (rate-search + process-tree-kill discovery) | 1.20 | +| C04a attempt 10 (percentile-metrics discovery) | 1.20 | +| C04a attempt 11 (workload-mismatch discovery, budget-capped) | 1.40 | +| OpenRouter Sonnet 4 (warmstart + operator LLM calls) | ~0.30 | +| **Total session GPU + LLM API spend** | **~9.77** | + +Pre-reg estimate: $3.20 (C04a $2 + C04b $1.20). +Actual: $9.77. +**Overrun: 3.05×.** + +The overrun is concentrated in fixes that turned out to be +necessary for ANY C04-shape comparison — not specific to this +campaign's framing. The PR chain is now the cost-amortizable shared +infrastructure for any future autoinfer-vs-vLLM comparison. ### Artifacts -- `basilica-artifacts/c04a--/` (per-trial JSON, - `events.jsonl`, `hw_context.json`, `results.tsv`, - `run_summary.json`). -- `basilica-artifacts/c04b--/` (same shape). -- `docs/research/references/12-c04-outcome.md` (analysis writeup; - TBD after the run). +Local artifact directories (one per attempt): + +- `basilica-artifacts/c04a-2026-05-26/` (attempt 2, DNS retry) +- `basilica-artifacts/c04a-2026-05-26-attempt3/` (Llama model fix) +- `basilica-artifacts/c04a-2026-05-26-attempt4/` (reference max_model_len) +- `basilica-artifacts/c04a-2026-05-26-attempt5/` (stderr archival landed) +- `basilica-artifacts/c04a-2026-05-26-attempt6/` (GMU inject) +- `basilica-artifacts/c04a-2026-05-26-attempt7/` (first 20/20 kept, goodput=0) +- `basilica-artifacts/c04a-2026-05-27-attempt8-2gpu/` (apples-to-apples 2-GPU) +- `basilica-artifacts/c04a-2026-05-27-attempt9-ratesearch/` (rate-search + pgkill discovery) +- `basilica-artifacts/c04a-2026-05-27-attempt10-pgkill/` (percentile-metrics discovery) +- `basilica-artifacts/c04a-2026-05-27-attempt11-final/` (workload-mismatch discovery) + +Each contains: per-trial JSONs, per-rate bench JSONs (after T-38), +per-trial candidate stderr logs (after PR #52), `hw_context.json`, +`events.jsonl`, `results.tsv`, `run_summary.json`. + +The artifacts and the PR chain together are the citable record of +the session. The Q1/Q2 verdict is not in these artifacts; it awaits +a future run. + +### Next-session restart point + +A future agent picking this up should: + +1. Read `docs/research/notes/c04-framing-overview-2026-05-26.md` + (the plain-language framing, PR #47). +2. Read this Outcome section. +3. Confirm `main` is at PR #58 or later (T-41 landed). +4. Verify `examples/c04a-l1-restricted/config.yaml` has + `random_input_len: 256` and `random_output_len: 20`. +5. Launch attempt 12 with the standard 2-GPU command from the + pre-reg's "Launch commands" section. Expected wall ~2-3 h; + expected cost $1.50-2.50. +6. If attempt 12 produces a clean 20/20 dataset with `goodput > 0`, + compare against T-37 Baseline B's 21.39 req/s. Then C04b. +7. If attempt 12 surfaces a twelfth bug, **stop** — the harness + needs architectural work beyond per-bug incremental fixes. + +### Pre-reg discipline observation + +The pre-registration discipline did exactly what it was designed +to do: it surfaced that the experiment didn't reach a verdict. +Without the pre-reg's explicit prediction probabilities and outcome +buckets, we might have written up "100% kept-rate, eight fixes +landed" as a success. With it, we're forced to acknowledge that +Q1 and Q2 are not yet answered — which is the truth. + +The cost overrun is the more useful surprise: ten unblock fixes +were genuinely necessary, none gratuitous, and the harness's +end-to-end integration with vLLM was much rougher than any prior +audit had revealed. Future campaigns should budget for this kind +of "first time we touched this code path" overhead even when +individual changes look small. + +### Closing + +C04 is **paused, not abandoned.** The integration-layer foundations +laid by this campaign are exactly what was missing from the +autoinfer harness in prior sessions; the next campaign that +re-enters this comparison surface should converge in 1-2 attempts, +not eleven. The pre-reg's questions remain valid; we just need a +clean attempt 12 with the now-correct workload params. + +A separate analysis writeup will be produced at +`docs/research/references/12-c04-outcome.md` once attempt 12 +produces a clean dataset. - Closing commits: TBD.