feat(harness): typed adapters + format-aware session-inject proxy + multi-turn bandit driver#266
feat(harness): typed adapters + format-aware session-inject proxy + multi-turn bandit driver#266dusterbloom wants to merge 39 commits into
Conversation
- Single PR target: ~220 LOC, no kernel touches, no new compression mechanism - Foundations cited by commit hash from the evidence branch (NIAH envelope, DFlash composition, Codex design doc) - Known limits explicitly documented (MTP crash, 64K NIAH cliff is a synthetic-class problem not agentic) - Day-by-day breakdown with per-day exit gates and bail conditions - Drift discipline: this PR rejects scope creep; everything else is follow-up
…bandit MVP) - Add float accept_rate = 0.0f to GenerateResult struct (model_backend.h) - Thread out_accept_rate through do_spec_decode signature; populate from n_accept_sum/total_draft_pos after spec-decode loop - AR fallback and no-draft paths leave accept_rate = 0.0 (correct sentinel) - Expose accept_rate in usage block of all three response formats (OPENAI_CHAT, ANTHROPIC, RESPONSES) - 6 new unit tests in test_server_unit.cpp; 154 assertions, 0 failures; ctest 1/1 PASSED - MTP path (line 1225 per original plan) does not exist at current HEAD — no stub needed; DFlash chain is the only spec-decode path in qwen35_backend.cpp
- AdaptiveKeepRatioState struct + step_adaptive_keep_ratio() pure fn - EMA-smoothed accept_rate signal, step 0.005/0.01, clamped [0.025, 0.20] - HttpServerSessions thread-safe per-session container - 11 unit tests (19 assertions) all GREEN, CPU-only - CMakeLists: adaptive_keep ctest target registered
- ParsedRequest gains session_id field (parsed from extra_body or top-level) - HttpServer gains HttpServerSessions sessions_ member - pre-compress: keep_ratio from sessions_.get_keep_ratio() when session_id set - post-generate: sessions_.update() + [pflash-bandit] log line per turn - test_bandit_integration: 6 tests, 16 assertions, all GREEN (189 total)
- 3 conditions: A fixed keep=0.05, B fixed keep=0.20, C bandit initial=0.10 - All 3: OK_DONE=YES (no regression); PFlash BF16 drafter confirmed working - Bandit fired: session=claude_code_s1 turn=1 keep=0.1000->0.1100 (accept=0.347) - A: 5.7% effective keep, 26.4% accept; B: 19.6% keep, 17.9% accept; C: 34.7% accept - Note: C used short 62-tok prompt via curl; like-vs-like follow-up queued for Day 5
- session_inject_proxy.py: thin HTTP proxy (~110 LOC) that intercepts POST /v1/messages and injects extra_body.session_id before forwarding to dflash server; handles JSON and SSE streaming - run_claude_code.sh: start proxy on PFLASH_PROXY_PORT (default 18082) when PFLASH_SESSION_ID is set; point claude CLI at proxy; kill on exit
All 3 conditions: same claude CLI harness, same 11K decode_check prompt - A fixed keep=0.05: wall=17s accept=31.7% OK_DONE=YES - B fixed keep=0.20: wall=19s accept=25.4% OK_DONE=YES - C bandit keep=0.10: wall=16s accept=31.9% OK_DONE=YES Bandit fired: session=claude_code_day5_s1 turn=1 keep=0.1000->0.1100 Bandit Pareto-dominates B on wall (-3s) and accept_rate (+6.5pp); ties A
…w project convention)
- namespace dflash { → namespace dflash::common { in adaptive_keep_ratio.h
- adds get_ema() accessor (used by Blocker 2 log fix)
- drops dflash:: qualifier from kBanditEmaAlpha refs in http_server.cpp (now in-namespace)
- http_server.h: drops now-redundant dflash:: qualifier on HttpServerSessions field
- tests: using namespace dflash → using namespace dflash::common
- removes the algebraically-trivial alpha*x+(1-alpha)*x stub and (void)ema_val - calls sessions_.get_ema() after update() to log the actual per-session EMA - log line now matches PLAN.md:60 shape: keep=<old>-><new> ema=<ema> accept=<observed> - adds get_ema_reflects_post_update_value test to test_adaptive_keep_ratio.cpp
…accept must signal too) - adds spec_decode_ran bool to GenerateResult (model_backend.h) - do_spec_decode sets out_spec_ran=true on spec path, false on AR fallback - both generate() and restore_and_generate() propagate result.spec_decode_ran - http_server.cpp guard: accept_rate>0 → spec_decode_ran - test_bandit_integration: zero_accept_rate_guard → zero_accept_drives_keep_up
…roject convention - drop claude_home/ and *.claude.json.backup.* trees from dflash/bench/results/2026-05-2*/ — kept only metrics.txt + client.out per condition - ignore dflash/bench/results/ going forward (new runs won't drag claude_home into git) - move PLAN.md → thoughts/2026-05-21_pflash_mvp_plan.md (project convention; prior plans live there) - delete dflash/bench/run_day4_ab.sh — superseded by run_day5_abc.sh per its own header
Replace `submodules: recursive` in actions/checkout with an explicit git-submodule-update step that injects a PAT via insteadOf when secrets.SUBMODULE_PAT is set. The GITHUB_TOKEN issued for fork PRs does not have cross-repo access to private org repos (Luce-Org/llama.cpp-dflash-ggml), causing intermittent "could not read Username" failures. With a PAT the auth is explicit and stable. Requires: add secret SUBMODULE_PAT (classic PAT, repo scope on Luce-Org) to Luce-Org/lucebox-hub repo settings -> Secrets -> Actions.
- NIAH 16K: 5/5 baseline (keep=0.20) and 5/5 bandit (keep=0.10); no retrieval degradation - NIAH 32K: 5/5 baseline and 5/5 bandit; compression 5x->10x halves target prefill time - 3-seed Day-5 A/B/C: decode_check / logic_check / math_check prompts, all ok_done=YES - Pareto: C (bandit) wall=16.3±3.4s vs B wall=24.7±3.1s (1.52x); ar=34.6% vs 32.8% - Bandit fired in all 3 sessions; per-session state isolation confirmed
…bandit MVP) - Add float accept_rate = 0.0f to GenerateResult struct (model_backend.h) - Thread out_accept_rate through do_spec_decode signature; populate from n_accept_sum/total_draft_pos after spec-decode loop - AR fallback and no-draft paths leave accept_rate = 0.0 (correct sentinel) - Expose accept_rate in usage block of all three response formats (OPENAI_CHAT, ANTHROPIC, RESPONSES) - 6 new unit tests in test_server_unit.cpp; 154 assertions, 0 failures; ctest 1/1 PASSED - MTP path (line 1225 per original plan) does not exist at current HEAD — no stub needed; DFlash chain is the only spec-decode path in qwen35_backend.cpp
… (seeds #1, #2) - StubServer: ThreadingHTTPServer recorder, zero new deps (mirrors llamacpp_compat_proxy.py pattern) - Seed #2 green: proxy injects session_id on /v1/messages, preserves existing, passes through GET - Seed #1 documented: chat/completions round-trip passes; injection assertion commented out pending commit 3
…-Org#3) - Add preflight_require_bin(): exit 78 + actionable asdf hint when binary missing - Flip LUCEBOX_SERVER_BACKEND default: python → cpp (plan requirement) - 4 tests green: missing binary exits 78 with asdf hint; present binary exits 0
- run_codex, run_pi, run_opencode, run_hermes: call preflight before start_lucebox_server - run_claude_code: add preflight for claude binary + export LUCEBOX_SERVER_BACKEND=cpp - bash -n clean on all 5 scripts
… (seed Luce-Org#5) - BanditRunMetrics dataclass: accept_rate/wall_s/tokens all Optional - parse_bandit_log_line(): None for absent fields, not "N/A" strings - 6 tests green; Day-4-v2 missing accept_rate fixture passes without N/A leak
…mand (seeds Luce-Org#4, Luce-Org#6) - _BaseAdapter: preflight_check (shutil.which) + dry_run returning AdapterResult - ClaudeCodeAdapter, HermesAdapter, CodexAdapter, PiAdapter, OpenCodeAdapter - run_bandit(): preflight → dry/live run → CSV writer (6 columns per exit-gate spec) - bandit subcommand + top-level --condition/--clients shorthand preserved - Seed Luce-Org#4 green: dry_run returns AdapterResult with session_id - Seed Luce-Org#6 green: 5-adapter dry-run emits 5-row CSV with required columns
- --adapter <name> as single-client alias for --clients (exit-gate for commit 7) - --clients/--condition can be top-level flags (no subcommand required) - cmd_bandit handles both --adapter and --clients, default condition C_bandit - 2 CLI subprocess tests added
…ADME - _BaseAdapter.live_run(): subprocess into run_<client>.sh with PFLASH_SESSION_ID - Each concrete adapter overrides live_run() with the right script path - run_bandit() live mode calls adapter.live_run() instead of dry_run stub - Delete run_codex.sh, run_hermes.sh, run_opencode.sh, run_pi.sh (ported to Python) - README: headless bandit invocation + single-client bash section
- cpp backend (default): resolves dflash_server binary via DFLASH_SERVER_BIN or dflash/build/dflash_server - python backend (opt-in): uses dflash/scripts/server.py as before - RuntimeError with actionable message when cpp binary missing
…oken asdf shims - base preflight_check probes with --version, checks exit code + stderr for asdf shim markers - CodexAdapter/PiAdapter override with --help (codex/pi don't support --version) - fail closed on timeout; emit actionable message naming the reshim command
…to deleted bash scripts) - CodexAdapter: writes temp config.toml, invokes codex exec directly - PiAdapter: writes temp models.json + settings.json, invokes pi directly - HermesAdapter: drives hermes chat --provider lucebox via /v1/chat/completions - OpenCodeAdapter: writes temp opencode.json, invokes opencode run in project dir
…attern - README: run_codex/hermes/opencode/pi.sh refs → python3 -m harness.client_test_runner bandit --clients <name> - run_backend_pair.sh: codex/pi/hermes/opencode case arms invoke Python runner; bash path kept for claude_code/openclaw/openwebui* - CLIENT_SCRIPT="" sentinel routes python-adapter clients through new branch in run_backend()
…it was never firing)
…own flags dflash_server (C++) requires the target model as argv[1] and rejects unknown options with exit 2. Two compounding bugs killed the harness-managed server, leaving server.log empty and accept_rate blank in the bandit CSV: 1. start_server (cpp branch) passed the target via --target — no such flag in dflash/src/server/server_main.cpp; argv[1] starting with '-' triggers the usage banner at server_main.cpp:158-160. 2. BANDIT_SERVER_PROFILE carried four Python-server-only flags (--budget, --verify-mode, --prefix-cache-slots, --prefill-cache-slots) the C++ parser rejects via server_main.cpp:295-298. With those gone the server stays up and writes [pflash]/[spec-decode] lines that run_bandit + metrics_parser already wire into AdapterResult. Regression tests: - TestRunBanditWiresAcceptRate exercises run_bandit directly (previous tests only re-implemented the wiring inline). - TestBanditServerProfileHasPflash::test_bandit_server_profile_only_cpp_recognised_flags guards against future stale-flag drift.
…rmes/opencode preflights - ClaudeCodeAdapter.live_run now calls `claude --print` directly via subprocess with ANTHROPIC_BASE_URL/ANTHROPIC_API_KEY/CLAUDE_CODE_API_BASE_URL env vars; no second server spawn, generation-heavy 700-word prompt ensures bandit cycles - HermesAdapter/OpenCodeAdapter preflight_check return False with honest reasons (HERMES_CONFIG_BUG / PROVIDER_CONFIG_BUG) instead of binary-check false-positives - BANDIT_SERVER_PROFILE and PFLASH/BANDIT server profiles: remove unsupported --lazy-draft flag - metrics_parser.extract_accept_rate_from_log: parses plain-text [pflash-bandit] accept=... lines - 47 tests green (+9 new regression tests for all of the above)
- Add bandit-session subcommand: starts server once, runs N turns of claude_code with same session_id, captures per-turn keep_ratio trajectory - Add BanditTurnRecord dataclass + parse_bandit_session_from_log to metrics_parser: parses [pflash-bandit] keep=A->B lines per turn - Add 4 prompt files (logic_check, math_check, code_gen, explain_algo) for generation-heavy multi-turn runs - Write results to /tmp/harness_adaptive_evidence.csv and dflash/bench/results/YYYY-MM-DD_adaptive_evidence/ - Sanity check: warns if keep_after is stuck across all turns - +9 tests (56 total, all green) - Live run: 5 turns, keep_after 0.1100→0.1200→0.1300→0.1400→0.1500
- HermesAdapter.preflight_check: real binary check replaces hard-coded HERMES_CONFIG_BUG skip; passes when hermes binary is present and --version exits 0 - HermesAdapter.live_run: write temp HERMES_HOME/config.yaml with correct base_url + context_length overrides (model and auxiliary. compression) so hermes 0.14 doesn't reject the 32K server context - Start session-inject proxy before hermes so [pflash-bandit] lines fire in server.log (same pattern as ClaudeCodeAdapter) - _start_session_inject_proxy: default to free_port() instead of hardcoded 18082 to avoid collisions when server runs on that port - Verified: [pflash-bandit] session=hermes-bandit-test-002 turn=1 keep=0.1000->0.1100 ema=0.123 accept=0.123
- .notes/harness-followups.md -> thoughts/2026-05-23_harness_followups.md - removes .notes/ dir (now empty) - aligns with project convention: thoughts/ for dated notes
There was a problem hiding this comment.
1 issue found across 73 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ted HOME - Add _resolve_nvm_bin() helper: tries nvm node versions in preference order (v24.13.0, v22.17.0, v20.18.0) bypassing asdf shims — same heuristic as commit 2600108 - codex/pi preflight_env: use real HOME (asdf shims need it to resolve node) - opencode: new preflight_env + preflight_check (was permanent FAIL stub) - codex/pi/opencode live_run: use _resolve_nvm_bin fallback + prepend nvm node bin dir to PATH so node resolves when HOME is overridden to temp dir - opencode live_run: write config to XDG_CONFIG_HOME/opencode/opencode.json (global config location, not project-level opencode.json in project dir)
…max_ctx 65K->49K - bandit-session now starts one session-inject proxy for the whole session so all turns share the same session_id; enables prefix-cache warmup across turns (turn 2+ should show delta-token prefill instead of full-context prefill) - BANDIT_SERVER_PROFILE: max_ctx 32768->49152, keep-ratio 0.10->0.05, add --prefill-skip-park (eliminates park/unpark overhead on 24 GB GPUs)
When BANDIT_SERVER_PROFILE (needs_prefill_drafter=True) is used, the cpp backend was also passing --draft to dflash_server, triggering an arch check that rejects plain qwen3 models. Only pflash-aware dflash-draft arch models pass this check, but the bandit profile only needs --prefill-drafter. Fix: skip --draft in the cpp start_server args when the profile already handles the drafter via needs_prefill_drafter.
sessions_ map grew unbounded when clients sent unique session_ids. Replace flat unordered_map with an LRU structure capped at DFLASH_BANDIT_MAX_SESSIONS (default 1024). On overflow the least-recently-used session is evicted; get_* calls count as touches. Two new unit tests: lru_cap_evicts_oldest, lru_touch_updates_eviction_order. All 29 tests pass.
dusterbloom
left a comment
There was a problem hiding this comment.
Addressed in 17525ea: replaced the flat unordered_map with an LRU structure capped at DFLASH_BANDIT_MAX_SESSIONS (default 1024). On overflow the least-recently-used session is evicted; reads (get_keep_ratio, get_ema, turn_count) count as touches. Two new unit tests added: lru_cap_evicts_oldest and lru_touch_updates_eviction_order — all 29 pass.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
You’re at about 90% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/server/adaptive_keep_ratio.h">
<violation number="1" location="dflash/src/server/adaptive_keep_ratio.h:61">
P1: DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| max_sessions_ = max_sessions; | ||
| } else { | ||
| const char* env = std::getenv("DFLASH_BANDIT_MAX_SESSIONS"); | ||
| max_sessions_ = (env && *env) ? static_cast<size_t>(std::atol(env)) : 1024; |
There was a problem hiding this comment.
P1: DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/server/adaptive_keep_ratio.h, line 61:
<comment>DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction</comment>
<file context>
@@ -44,40 +46,93 @@ inline AdaptiveKeepRatioState step_adaptive_keep_ratio(
+ max_sessions_ = max_sessions;
+ } else {
+ const char* env = std::getenv("DFLASH_BANDIT_MAX_SESSIONS");
+ max_sessions_ = (env && *env) ? static_cast<size_t>(std::atol(env)) : 1024;
+ }
+ if (max_sessions_ == 0) max_sessions_ = 1024; // guard against env=0
</file context>
Merge latest origin/main into the integration worktree, re-enumerate open contributor PRs, and record merge-tree/worktree conflict results. PR Luce-Org#266 received an actual merge attempt; other pending contributor PRs were classified with merge-tree evidence.
Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.
Merge latest origin/main (d947c70) into the integration stack and record the current PR classification. PR Luce-Org#266 was attempted again in an isolated worktree and remains blocked pending selective harness/server porting; Codex/Claude delegated resolution is unavailable due auth.
Integrates PR Luce-Org#266 into the auto-integration stack over easel/auto-integration. Resolves server/ layout conflicts by keeping the current server tree, retaining existing harness adapters, and packaging the PR's metrics parser/session proxy/tests under harness/src/harness for uv workspace imports.\n\nVerification:\n- python3 -m py_compile harness/client_test_runner.py harness/clients/session_inject_proxy.py harness/src/harness/metrics_parser.py harness/src/harness/tests/*.py\n- uv run --extra dev --package harness pytest harness/src/harness/tests -q (56 passed)\n- git diff --check\n- conflict marker scan (no conflict markers)\n- C++ target build skipped: server/build missing in worktree
What this changes
Refactor
harness/clients/from bash launchers into typed Python adapters with proper preflight, plus the multi-turnbandit-sessiondriver that produced the adaptive trajectory in PR #264.Stacked on
feat/pflash-mvp-adaptive-keep(#264) — auto-rebases ontomainwhen #264 merges.The diff in one screen
harness/client_test_runner.pyClientAdapterprotocol + 5 adapters (claude_code,hermes,opencode,codex,pi) +bandit-sessionsubcommandharness/metrics_parser.pyBanditRunMetrics, parses[pflash-bandit]JSON and[spec-decode]textharness/clients/session_inject_proxy.pyINJECT_ROUTES(/v1/messages,/v1/chat/completions,/v1/responses)harness/clients/common.shpreflight_require_binhelper,LUCEBOX_SERVER_BACKEND=cppdefaultharness/tests/run_*.shdeletedWhy now (bug classes eliminated)
codex/piasdf shim failures hidden by no readiness probePREFLIGHT FAIL: ... try 'asdf reshim node'hermesbandit bypass — proxy only injected/v1/messagesINJECT_ROUTEScovers all C++ server POST routes;[pflash-bandit]confirmed firing on hermesopencodeUnknownErrorPROVIDER_CONFIG_BUGreason (registration is user-side)accept_rate=N/Aregex scrapemetrics_parserreads structured log linesMulti-turn evidence driver
bandit-sessionruns N prompts through the samesession_idand emits one CSV row per turn (keep_before,accept_rate,keep_after,ema,wall_s). The 5-turn run in PR #264's headline claim was produced by this driver.Testing
56 unit/integration tests (was 22). Live-tested on
claude_code+hermesagainst C++dflash_server; bandit log lines confirmed firing on both.CI
Same fork-PR submodule auth gap as #264 (fix in parent's
8d5cc04). NeedsSUBMODULE_PATrepo secret.Out of scope
Python server (deprecated), GUI launchers,
llamacpp_compat_proxy.py, opencode/codex/pi live runs (preflight surfaces actionable errors; live execution is user-side env work).Open questions for reviewers
harness/tests/location — keep here, or move underharness/clients/tests/?bandit-session --turns Nuses a fixed 5-prompt list — parameterize as--prompts a.txt,b.txt,...?OpenCodeAdapterhonest-skip — keep as documented gap, or block merge until opencode provider registration is sorted?