Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Bands by priority:

| ID | Item | Why it blocks | Reference |
|---|---|---|---|
| T-42 | C04a attempt 12 — clean relaunch after T-41 lands; matches T-37 Baseline B workload (256/20) on 2× A100 spot | C04 pre-reg's Q1 and Q2 remain unanswered after 11 attempts produced integration-layer fixes but no comparable goodput dataset. After PR #58 (T-41) landed `random_input_len`/`random_output_len` through the config, the C04a config now matches T-37 Baseline B exactly. Single capped relaunch with $2-2.50 budget is the right next step; if it produces a clean 20/20 kept dataset, C04b follows; if not, harness needs architectural work beyond per-bug incremental fixes. **Requires user GPU-spend authorization.** | `examples/c04a-l1-restricted/config.yaml` (random_input_len: 256, random_output_len: 20); pre-reg `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` |
| T-30 | Verify vLLM's MLP path actually calls patched `SiluAndMul.forward_cuda` during serving | Kernel-level audit (2026-04-30) found Sonnet 4's silu_mul implementations are 3-100× SLOWER at kernel level, yet C03-S end-to-end silu_mul NOV pairs were ties (+0.10%). This strongly suggests vLLM bypasses our monkey-patch via a fused gemm+silu+mul path or `forward_native` dispatcher. Without this verification any silu_mul-related kernel claim is unsound. | `layers/l3_kernel/injector.py:_TARGET_BINDINGS` for SiluAndMul |
| T-31 | Replace L3 mode='vllm' end-to-end paired-control with kernel-level paired-control for non-hot-path kernels | Kernel-level audit (2026-04-30) confirmed end-to-end serving tok/s on Qwen3-8B cannot expose 2-9× kernel-level rmsnorm speedups because rmsnorm is ~3-5% of compute and end-to-end noise is ±0.5-1%. The audit's microbench pattern (importlib + cudaEvent + median timing across multiple shapes) is the right primitive; bake it into the L3 adapter as `mode='kernel'` and use that for any non-hot-path Q1. **Must use three-way (PyTorch / vllm-native / NOV) primitive per T-32.** | `layers/l3_kernel/adapter.py` |
| T-32 | Production-baseline gap: Sonnet 4's single-shot Triton rmsnorm loses to vLLM-native CUDA at every Qwen3-8B shape | Three-way audit (2026-04-30, `kernel_level_audit_v3_three_way.json`) shows vLLM-native CUDA rmsnorm is **1.06–1.65× faster than Sonnet 4's Triton at every Qwen3-8B shape on 1× A100 80GB PCIe**. The original v2 "2–9× faster than PyTorch" reading was a strawman comparison: production vLLM doesn't run unfused PyTorch. **Citable production-relevant claim is currently NEGATIVE for the rmsnorm surface.** Three options: (1) re-run with stronger code-emission model (Sonnet 4.5, GPT-5-codex, DeepSeek-Coder-33B); (2) add post-emission Triton autotune sweep (BLOCK_SIZE, num_warps, num_stages) before the production-baseline A/B; (3) target less-optimised kernel surfaces. Also: silu_mul three-way could not run at vllm 0.20.0 — `_custom_ops.silu_and_mul` AttributeError; needs binding-name fix before silu_mul claim is even comparable. | `kernel_level_audit_v3_three_way.json`, `layers/l3_kernel/proposer.py` |
Expand Down Expand Up @@ -74,3 +75,8 @@ Bands by priority:
| T-37 | vLLM `auto_tune.sh` baseline on Basilica — three reference points captured | (this commit) | Per C04 recon's 2026-05-26 corrective addendum ("Both, in sequence" + H100 anchor). Three baselines captured on `vllm/vllm-openai:v0.21.0` (commit `ad7125a431e176d4161099480a66f0169609a690`), all via the SDK-orchestrated `scripts/run_auto_tune_baseline.py` (PR #41–#45): **Baseline A** (A100 spot, INPUT=1800/OUTPUT=20, no SLO) → max_num_seqs=256, max_num_batched_tokens=4096, **throughput=8.53 req/s**. **Baseline B** (A100 spot, INPUT=256/OUTPUT=20, 500 ms SLO) → 256/512, **goodput=21.39 req/s**, P99 E2EL=494.60 ms. **Baseline C** (H100 spot, INPUT=1800/OUTPUT=20, 500 ms SLO; the auto_tune README target) → 256/512, **goodput=2.97 req/s**, P99 E2EL=457.27 ms. Total cost ~$3.17 across all attempts + 3 final baselines. Three earlier-attempt fixes landed during T-37 (PRs #43/#44/#45): apt-install bc, rename cloned vllm source dir to avoid import shadow, sed-patch `auto_tune.sh` `HOSTNAME=$(hostname)` → `HOSTNAME=localhost`. Raw artifact + full per-cell grids + reproduction recipe in `docs/research/raw/auto_tune-baseline-2026-05-26.md`. |
| T-29 | Paired-control prompt robustness — split into per-cell sequential calls | (PR #17) | `KernelProposer.propose_for_cells` now issues N separate per-cell LLM calls via `build_single_cell_kernel_prompt` instead of one batched 6-block paired prompt. C03-S validated: NOV-half failure rate dropped from C02's 2/6 (33%) to 1/9 (11%), at the pre-registered Outcome H threshold. |
| Campaign 03 | A100 narrow-replication paired-control (1× A100 spot, OpenRouter Sonnet 4, T-26b + T-29 + 1-GPU mode) | run completed `8c2ef41`; pre-registration + outcome at `docs/research/campaigns/03-h100-replication-2026-04-27.md` | 60 trials in 144 min, ~$15 for the final S run + ~$30-40 in earlier failed attempts. **Q1 RETRACTED 2026-04-30 — twice.** First retraction (v2 audit, `kernel_level_audit_results.json`): end-to-end serving tok/s on Qwen3-8B is dominated by attention (rmsnorm ~3-5% of compute), so kernel-level speedups can't clear the ±0.5-1% noise floor. Sonnet's rmsnorm kernels measured 2-9× faster than PyTorch unfused reference. **v3 audit re-correction (`kernel_level_audit_v3_three_way.json`):** the v2 framing was a strawman — production vLLM runs `vllm._custom_ops.rms_norm` (a hand-tuned CUDA kernel), not unfused PyTorch. With three-way comparison (PyTorch / vllm-native / Sonnet), **vLLM-native CUDA is 1.06-1.65× FASTER than Sonnet 4's Triton at every Qwen3-8B shape**. Sonnet's Triton beats unfused PyTorch by 3-10× but loses to vLLM-native by 6-65%. **Production-relevant kernel-novelty speedup at the rmsnorm surface is NEGATIVE.** silu_mul 3-way couldn't run at vllm 0.20.0 (`_custom_ops.silu_and_mul` AttributeError); v2 finding (Sonnet's silu_mul Triton 3-100× slower than PyTorch unfused) stands but production-relative still unmeasured. Q2 partial (~20% L1 surrogate kept-rate; below 30% — opens T-26c). Q3 deferred. Q4 AFFIRMED — T-29 dropped NOV-half failure rate to 1/9. 7 PRs of latent bugs (#14-#21). Opens T-30 (silu_mul patch verification), T-31 (L3 kernel-level mode), **T-32 (production-baseline gap; three-way primitive required for any future kernel claim)**. |
| T-38 | Driver rate-search mirroring `auto_tune.sh` rate-down algorithm | (PR #55) | `harness/driver.py:run_driver_with_rate_search` walks rate from `start_rate` down to `min_rate` in `step_size` halvings (mirroring `auto_tune.sh:34-67`); first rate whose P99 metrics all clear the goodput SLO wins. `L1EngineAdapter._run_benchmarks` selects this path when `driver_use_rate_search=True` (default) and a goodput SLO is configured; falls back to single-shot at `request_rate=inf` for legacy throughput-mode runs. Per-rate bench JSONs archived to `<artifact_dir>/<trial_id>/rate_<rate>.json`. Returns `(measurement, chosen_request_rate, per_rate_summaries)` so the per-trial JSON records both the chosen rate and the climb-down trace. 5 new tests in `test_driver.py`. |
| T-39 | Process-group kill of candidate via `os.killpg` (candidate kept 74 GiB across trials at C04a attempt 9) | (PR #56) | `L1EngineAdapter._start_candidate` sets `start_new_session=True` on the `subprocess.Popen` call so candidate + EngineCore children share a process group. `_stop_candidate` calls `os.killpg(pgid, SIGTERM)` then escalates to `SIGKILL` after a 30s wait, with a 5s post-kill sleep for CUDA driver cleanup. Without this, a bench timeout would orphan the EngineCore child holding all the HBM, and the next trial's reference probe would see only 5.45/79 GiB free. 4 new tests in `test_l1_adapter.py`. |
| T-40 | Explicit `--percentile-metrics ttft,tpot,itl,e2el` + bench timeout 1800→600s | (PR #57) | `build_bench_command` now always emits `--percentile-metrics ttft,tpot,itl,e2el` (vLLM bench's `--save-result` JSON otherwise omits `e2el` fields silently, breaking E2E-SLO goodput evaluation). `driver_timeout_s` default lowered from 1800 to 600 in `L1EngineAdapter` — at the C04 workload a single bench should complete in ~30-90s, so a 10-minute ceiling catches stuck runs without wasting trial budget. 3 new tests in `test_driver.py`. |
| T-41 | Workload params (`random_input_len`/`random_output_len`) through DriverConfig → adapter → driver | (PR #58) | Driver was hardcoded to `random_input_len=128, random_output_len=64` while T-37 Baseline B used `256/20`. With `output_len=64` and a 500ms E2E SLO, decode time alone (~1920ms) exceeded SLO, blocking any rate from meeting the goodput target at C04a attempt 11. `DriverConfig.random_input_len: int = 128` and `random_output_len: int = 64` (Pydantic Field with ge=1); threaded through `L1EngineAdapter` (dataclass field, default 128/64), `_run_benchmarks` → `run_driver`/`run_driver_with_rate_search` → `build_bench_command`. Builder passes both from `cfg.harness.driver` to the L1 spec. `examples/c04a-l1-restricted/config.yaml` and `examples/c04b-l1-full/config.yaml` updated to `256`/`20`. 7 new tests across `test_config.py` (defaults, T-37 B values, ge=1), `test_driver.py` (CLI emission), `test_builder_joint.py` (builder threading). Total CPU tests: 437 passing. |
| C04a attempts 1-11 | Eleven launch attempts of C04a between 2026-05-26 and 2026-05-27 — none reached a comparable goodput dataset; 10 integration-layer fixes landed; opens T-42 for the clean relaunch | (PRs #49-#58); pre-reg outcome at `docs/research/campaigns/04-l1-autotune-comparable-2026-05-26.md` Outcome section | Q1+Q2 NOT MEASURED — workload mismatch persisted until T-41 landed in attempt 11 budget. **Q3 AFFIRMED at 100% kept-rate (20/20)** at attempt 7, far above the 30% Outcome A3 threshold — T-26c L1 surrogate validated under real serving (caveat: under shared-GPU contention the gate's auto-calibrated KL ceiling was looser than the configured 2.0; tight-gate kept-rate awaits the 2-GPU re-run). Q4 partial — goodput-axis wiring correct but values were 0.0 because workload blocked any rate from meeting SLO. Total cost ~$9.77 vs $3.20 pre-reg estimate (3.05× overrun). Bugs fixed: `--image` flag (#49), `--model` flag (#50), reference replica max_model_len cap (#51), candidate max_model_len cap + stderr archival (#52), GMU inject parallel to clamp (#53), `--goodput` lowercase metric names (#54), T-38 rate-search (#55), T-39 process-group kill (#56), T-40 percentile-metrics + tighter timeout (#57), T-41 workload params threading (#58). |
Loading