epappas · epappas · May 27, 2026
diff --git a/TODO.md b/TODO.md
@@ -21,6 +21,7 @@ Bands by priority:
 
 | ID | Item | Why it blocks | Reference |
 |---|---|---|---|
+| T-42 | C04a attempt 12 — clean relaunch after T-41 lands; matches T-37 Baseline B workload (256/20) on 2× A100 spot | C04 pre-reg's Q1 and Q2 remain unanswered after 11 attempts produced integration-layer fixes but no comparable goodput dataset. After PR #58 (T-41) landed `random_input_len`/`random_output_len` through the config, the C04a config now matches T-37 Baseline B exactly. Single capped relaunch with $2-2.50 budget is the right next step; if it produces a clean 20/20 kept dataset, C04b follows; if not, harness needs architectural work beyond per-bug incremental fixes. **Requires user GPU-spend authorization.** | `examples/c04a-l1-restricted/config.yaml` (random_input_len: 256, random_output_len: 20); pre-reg `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` |
 | T-30 | Verify vLLM's MLP path actually calls patched `SiluAndMul.forward_cuda` during serving | Kernel-level audit (2026-04-30) found Sonnet 4's silu_mul implementations are 3-100× SLOWER at kernel level, yet C03-S end-to-end silu_mul NOV pairs were ties (+0.10%). This strongly suggests vLLM bypasses our monkey-patch via a fused gemm+silu+mul path or `forward_native` dispatcher. Without this verification any silu_mul-related kernel claim is unsound. | `layers/l3_kernel/injector.py:_TARGET_BINDINGS` for SiluAndMul |
 | T-31 | Replace L3 mode='vllm' end-to-end paired-control with kernel-level paired-control for non-hot-path kernels | Kernel-level audit (2026-04-30) confirmed end-to-end serving tok/s on Qwen3-8B cannot expose 2-9× kernel-level rmsnorm speedups because rmsnorm is ~3-5% of compute and end-to-end noise is ±0.5-1%. The audit's microbench pattern (importlib + cudaEvent + median timing across multiple shapes) is the right primitive; bake it into the L3 adapter as `mode='kernel'` and use that for any non-hot-path Q1. **Must use three-way (PyTorch / vllm-native / NOV) primitive per T-32.** | `layers/l3_kernel/adapter.py` |
 | T-32 | Production-baseline gap: Sonnet 4's single-shot Triton rmsnorm loses to vLLM-native CUDA at every Qwen3-8B shape | Three-way audit (2026-04-30, `kernel_level_audit_v3_three_way.json`) shows vLLM-native CUDA rmsnorm is **1.06–1.65× faster than Sonnet 4's Triton at every Qwen3-8B shape on 1× A100 80GB PCIe**. The original v2 "2–9× faster than PyTorch" reading was a strawman comparison: production vLLM doesn't run unfused PyTorch. **Citable production-relevant claim is currently NEGATIVE for the rmsnorm surface.** Three options: (1) re-run with stronger code-emission model (Sonnet 4.5, GPT-5-codex, DeepSeek-Coder-33B); (2) add post-emission Triton autotune sweep (BLOCK_SIZE, num_warps, num_stages) before the production-baseline A/B; (3) target less-optimised kernel surfaces. Also: silu_mul three-way could not run at vllm 0.20.0 — `_custom_ops.silu_and_mul` AttributeError; needs binding-name fix before silu_mul claim is even comparable. | `kernel_level_audit_v3_three_way.json`, `layers/l3_kernel/proposer.py` |
@@ -74,3 +75,8 @@ Bands by priority:
 | T-37 | vLLM `auto_tune.sh` baseline on Basilica — three reference points captured | (this commit) | Per C04 recon's 2026-05-26 corrective addendum ("Both, in sequence" + H100 anchor). Three baselines captured on `vllm/vllm-openai:v0.21.0` (commit `ad7125a431e176d4161099480a66f0169609a690`), all via the SDK-orchestrated `scripts/run_auto_tune_baseline.py` (PR #41–#45): **Baseline A** (A100 spot, INPUT=1800/OUTPUT=20, no SLO) → max_num_seqs=256, max_num_batched_tokens=4096, **throughput=8.53 req/s**. **Baseline B** (A100 spot, INPUT=256/OUTPUT=20, 500 ms SLO) → 256/512, **goodput=21.39 req/s**, P99 E2EL=494.60 ms. **Baseline C** (H100 spot, INPUT=1800/OUTPUT=20, 500 ms SLO; the auto_tune README target) → 256/512, **goodput=2.97 req/s**, P99 E2EL=457.27 ms. Total cost ~$3.17 across all attempts + 3 final baselines. Three earlier-attempt fixes landed during T-37 (PRs #43/#44/#45): apt-install bc, rename cloned vllm source dir to avoid import shadow, sed-patch `auto_tune.sh` `HOSTNAME=$(hostname)` → `HOSTNAME=localhost`. Raw artifact + full per-cell grids + reproduction recipe in `docs/research/raw/auto_tune-baseline-2026-05-26.md`. |
 | T-29 | Paired-control prompt robustness — split into per-cell sequential calls | (PR #17) | `KernelProposer.propose_for_cells` now issues N separate per-cell LLM calls via `build_single_cell_kernel_prompt` instead of one batched 6-block paired prompt. C03-S validated: NOV-half failure rate dropped from C02's 2/6 (33%) to 1/9 (11%), at the pre-registered Outcome H threshold. |
 | Campaign 03 | A100 narrow-replication paired-control (1× A100 spot, OpenRouter Sonnet 4, T-26b + T-29 + 1-GPU mode) | run completed `8c2ef41`; pre-registration + outcome at `docs/research/campaigns/03-h100-replication-2026-04-27.md` | 60 trials in 144 min, ~$15 for the final S run + ~$30-40 in earlier failed attempts. **Q1 RETRACTED 2026-04-30 — twice.** First retraction (v2 audit, `kernel_level_audit_results.json`): end-to-end serving tok/s on Qwen3-8B is dominated by attention (rmsnorm ~3-5% of compute), so kernel-level speedups can't clear the ±0.5-1% noise floor. Sonnet's rmsnorm kernels measured 2-9× faster than PyTorch unfused reference. **v3 audit re-correction (`kernel_level_audit_v3_three_way.json`):** the v2 framing was a strawman — production vLLM runs `vllm._custom_ops.rms_norm` (a hand-tuned CUDA kernel), not unfused PyTorch. With three-way comparison (PyTorch / vllm-native / Sonnet), **vLLM-native CUDA is 1.06-1.65× FASTER than Sonnet 4's Triton at every Qwen3-8B shape**. Sonnet's Triton beats unfused PyTorch by 3-10× but loses to vLLM-native by 6-65%. **Production-relevant kernel-novelty speedup at the rmsnorm surface is NEGATIVE.** silu_mul 3-way couldn't run at vllm 0.20.0 (`_custom_ops.silu_and_mul` AttributeError); v2 finding (Sonnet's silu_mul Triton 3-100× slower than PyTorch unfused) stands but production-relative still unmeasured. Q2 partial (~20% L1 surrogate kept-rate; below 30% — opens T-26c). Q3 deferred. Q4 AFFIRMED — T-29 dropped NOV-half failure rate to 1/9. 7 PRs of latent bugs (#14-#21). Opens T-30 (silu_mul patch verification), T-31 (L3 kernel-level mode), **T-32 (production-baseline gap; three-way primitive required for any future kernel claim)**. |
+| T-38 | Driver rate-search mirroring `auto_tune.sh` rate-down algorithm | (PR #55) | `harness/driver.py:run_driver_with_rate_search` walks rate from `start_rate` down to `min_rate` in `step_size` halvings (mirroring `auto_tune.sh:34-67`); first rate whose P99 metrics all clear the goodput SLO wins. `L1EngineAdapter._run_benchmarks` selects this path when `driver_use_rate_search=True` (default) and a goodput SLO is configured; falls back to single-shot at `request_rate=inf` for legacy throughput-mode runs. Per-rate bench JSONs archived to `<artifact_dir>/<trial_id>/rate_<rate>.json`. Returns `(measurement, chosen_request_rate, per_rate_summaries)` so the per-trial JSON records both the chosen rate and the climb-down trace. 5 new tests in `test_driver.py`. |
+| T-39 | Process-group kill of candidate via `os.killpg` (candidate kept 74 GiB across trials at C04a attempt 9) | (PR #56) | `L1EngineAdapter._start_candidate` sets `start_new_session=True` on the `subprocess.Popen` call so candidate + EngineCore children share a process group. `_stop_candidate` calls `os.killpg(pgid, SIGTERM)` then escalates to `SIGKILL` after a 30s wait, with a 5s post-kill sleep for CUDA driver cleanup. Without this, a bench timeout would orphan the EngineCore child holding all the HBM, and the next trial's reference probe would see only 5.45/79 GiB free. 4 new tests in `test_l1_adapter.py`. |
+| T-40 | Explicit `--percentile-metrics ttft,tpot,itl,e2el` + bench timeout 1800→600s | (PR #57) | `build_bench_command` now always emits `--percentile-metrics ttft,tpot,itl,e2el` (vLLM bench's `--save-result` JSON otherwise omits `e2el` fields silently, breaking E2E-SLO goodput evaluation). `driver_timeout_s` default lowered from 1800 to 600 in `L1EngineAdapter` — at the C04 workload a single bench should complete in ~30-90s, so a 10-minute ceiling catches stuck runs without wasting trial budget. 3 new tests in `test_driver.py`. |
+| T-41 | Workload params (`random_input_len`/`random_output_len`) through DriverConfig → adapter → driver | (PR #58) | Driver was hardcoded to `random_input_len=128, random_output_len=64` while T-37 Baseline B used `256/20`. With `output_len=64` and a 500ms E2E SLO, decode time alone (~1920ms) exceeded SLO, blocking any rate from meeting the goodput target at C04a attempt 11. `DriverConfig.random_input_len: int = 128` and `random_output_len: int = 64` (Pydantic Field with ge=1); threaded through `L1EngineAdapter` (dataclass field, default 128/64), `_run_benchmarks` → `run_driver`/`run_driver_with_rate_search` → `build_bench_command`. Builder passes both from `cfg.harness.driver` to the L1 spec. `examples/c04a-l1-restricted/config.yaml` and `examples/c04b-l1-full/config.yaml` updated to `256`/`20`. 7 new tests across `test_config.py` (defaults, T-37 B values, ge=1), `test_driver.py` (CLI emission), `test_builder_joint.py` (builder threading). Total CPU tests: 437 passing. |
+| C04a attempts 1-11 | Eleven launch attempts of C04a between 2026-05-26 and 2026-05-27 — none reached a comparable goodput dataset; 10 integration-layer fixes landed; opens T-42 for the clean relaunch | (PRs #49-#58); pre-reg outcome at `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` Outcome section | Q1+Q2 NOT MEASURED — workload mismatch persisted until T-41 landed in attempt 11 budget. **Q3 AFFIRMED at 100% kept-rate (20/20)** at attempt 7, far above the 30% Outcome A3 threshold — T-26c L1 surrogate validated under real serving (caveat: under shared-GPU contention the gate's auto-calibrated KL ceiling was looser than the configured 2.0; tight-gate kept-rate awaits the 2-GPU re-run). Q4 partial — goodput-axis wiring correct but values were 0.0 because workload blocked any rate from meeting SLO. Total cost ~$9.77 vs $3.20 pre-reg estimate (3.05× overrun). Bugs fixed: `--image` flag (#49), `--model` flag (#50), reference replica max_model_len cap (#51), candidate max_model_len cap + stderr archival (#52), GMU inject parallel to clamp (#53), `--goodput` lowercase metric names (#54), T-38 rate-search (#55), T-39 process-group kill (#56), T-40 percentile-metrics + tighter timeout (#57), T-41 workload params threading (#58). |