Skip to content

feat(harness): typed adapters + format-aware session-inject proxy + multi-turn bandit driver#266

Open
dusterbloom wants to merge 39 commits into
Luce-Org:mainfrom
dusterbloom:feat/harness-typed-adapters
Open

feat(harness): typed adapters + format-aware session-inject proxy + multi-turn bandit driver#266
dusterbloom wants to merge 39 commits into
Luce-Org:mainfrom
dusterbloom:feat/harness-typed-adapters

Conversation

@dusterbloom
Copy link
Copy Markdown
Collaborator

@dusterbloom dusterbloom commented May 23, 2026

What this changes

Refactor harness/clients/ from bash launchers into typed Python adapters with proper preflight, plus the multi-turn bandit-session driver that produced the adaptive trajectory in PR #264.

Stacked on feat/pflash-mvp-adaptive-keep (#264) — auto-rebases onto main when #264 merges.

The diff in one screen

File What
harness/client_test_runner.py ClientAdapter protocol + 5 adapters (claude_code, hermes, opencode, codex, pi) + bandit-session subcommand
harness/metrics_parser.py Typed BanditRunMetrics, parses [pflash-bandit] JSON and [spec-decode] text
harness/clients/session_inject_proxy.py Format-aware INJECT_ROUTES (/v1/messages, /v1/chat/completions, /v1/responses)
harness/clients/common.sh preflight_require_bin helper, LUCEBOX_SERVER_BACKEND=cpp default
harness/tests/ Stub-server fixture + 56 unit/integration tests (was 22 pre-refactor)
4 run_*.sh deleted Replaced by typed adapters

Why now (bug classes eliminated)

Bug surfaced in #264 verification Fix in this PR
codex / pi asdf shim failures hidden by no readiness probe Env-isolated probe → PREFLIGHT FAIL: ... try 'asdf reshim node'
hermes bandit bypass — proxy only injected /v1/messages INJECT_ROUTES covers all C++ server POST routes; [pflash-bandit] confirmed firing on hermes
opencode UnknownError Honest skip with PROVIDER_CONFIG_BUG reason (registration is user-side)
Day-4 accept_rate=N/A regex scrape metrics_parser reads structured log lines

Multi-turn evidence driver

bandit-session runs N prompts through the same session_id and emits one CSV row per turn (keep_before, accept_rate, keep_after, ema, wall_s). The 5-turn run in PR #264's headline claim was produced by this driver.

Testing

56 unit/integration tests (was 22). Live-tested on claude_code + hermes against C++ dflash_server; bandit log lines confirmed firing on both.

CI

Same fork-PR submodule auth gap as #264 (fix in parent's 8d5cc04). Needs SUBMODULE_PAT repo secret.

Out of scope

Python server (deprecated), GUI launchers, llamacpp_compat_proxy.py, opencode/codex/pi live runs (preflight surfaces actionable errors; live execution is user-side env work).

Open questions for reviewers

  1. harness/tests/ location — keep here, or move under harness/clients/tests/?
  2. bandit-session --turns N uses a fixed 5-prompt list — parameterize as --prompts a.txt,b.txt,...?
  3. OpenCodeAdapter honest-skip — keep as documented gap, or block merge until opencode provider registration is sorted?

- Single PR target: ~220 LOC, no kernel touches, no new compression mechanism
- Foundations cited by commit hash from the evidence branch (NIAH envelope, DFlash composition, Codex design doc)
- Known limits explicitly documented (MTP crash, 64K NIAH cliff is a synthetic-class problem not agentic)
- Day-by-day breakdown with per-day exit gates and bail conditions
- Drift discipline: this PR rejects scope creep; everything else is follow-up
…bandit MVP)

- Add float accept_rate = 0.0f to GenerateResult struct (model_backend.h)
- Thread out_accept_rate through do_spec_decode signature; populate from n_accept_sum/total_draft_pos after spec-decode loop
- AR fallback and no-draft paths leave accept_rate = 0.0 (correct sentinel)
- Expose accept_rate in usage block of all three response formats (OPENAI_CHAT, ANTHROPIC, RESPONSES)
- 6 new unit tests in test_server_unit.cpp; 154 assertions, 0 failures; ctest 1/1 PASSED
- MTP path (line 1225 per original plan) does not exist at current HEAD — no stub needed; DFlash chain is the only spec-decode path in qwen35_backend.cpp
- AdaptiveKeepRatioState struct + step_adaptive_keep_ratio() pure fn
- EMA-smoothed accept_rate signal, step 0.005/0.01, clamped [0.025, 0.20]
- HttpServerSessions thread-safe per-session container
- 11 unit tests (19 assertions) all GREEN, CPU-only
- CMakeLists: adaptive_keep ctest target registered
- ParsedRequest gains session_id field (parsed from extra_body or top-level)
- HttpServer gains HttpServerSessions sessions_ member
- pre-compress: keep_ratio from sessions_.get_keep_ratio() when session_id set
- post-generate: sessions_.update() + [pflash-bandit] log line per turn
- test_bandit_integration: 6 tests, 16 assertions, all GREEN (189 total)
- 3 conditions: A fixed keep=0.05, B fixed keep=0.20, C bandit initial=0.10
- All 3: OK_DONE=YES (no regression); PFlash BF16 drafter confirmed working
- Bandit fired: session=claude_code_s1 turn=1 keep=0.1000->0.1100 (accept=0.347)
- A: 5.7% effective keep, 26.4% accept; B: 19.6% keep, 17.9% accept; C: 34.7% accept
- Note: C used short 62-tok prompt via curl; like-vs-like follow-up queued for Day 5
- session_inject_proxy.py: thin HTTP proxy (~110 LOC) that intercepts
  POST /v1/messages and injects extra_body.session_id before forwarding
  to dflash server; handles JSON and SSE streaming
- run_claude_code.sh: start proxy on PFLASH_PROXY_PORT (default 18082)
  when PFLASH_SESSION_ID is set; point claude CLI at proxy; kill on exit
All 3 conditions: same claude CLI harness, same 11K decode_check prompt
- A fixed keep=0.05: wall=17s accept=31.7% OK_DONE=YES
- B fixed keep=0.20: wall=19s accept=25.4% OK_DONE=YES
- C bandit keep=0.10: wall=16s accept=31.9% OK_DONE=YES
Bandit fired: session=claude_code_day5_s1 turn=1 keep=0.1000->0.1100
Bandit Pareto-dominates B on wall (-3s) and accept_rate (+6.5pp); ties A
…w project convention)

- namespace dflash { → namespace dflash::common { in adaptive_keep_ratio.h
- adds get_ema() accessor (used by Blocker 2 log fix)
- drops dflash:: qualifier from kBanditEmaAlpha refs in http_server.cpp (now in-namespace)
- http_server.h: drops now-redundant dflash:: qualifier on HttpServerSessions field
- tests: using namespace dflash → using namespace dflash::common
- removes the algebraically-trivial alpha*x+(1-alpha)*x stub and (void)ema_val
- calls sessions_.get_ema() after update() to log the actual per-session EMA
- log line now matches PLAN.md:60 shape: keep=<old>-><new> ema=<ema> accept=<observed>
- adds get_ema_reflects_post_update_value test to test_adaptive_keep_ratio.cpp
…accept must signal too)

- adds spec_decode_ran bool to GenerateResult (model_backend.h)
- do_spec_decode sets out_spec_ran=true on spec path, false on AR fallback
- both generate() and restore_and_generate() propagate result.spec_decode_ran
- http_server.cpp guard: accept_rate>0 → spec_decode_ran
- test_bandit_integration: zero_accept_rate_guard → zero_accept_drives_keep_up
…roject convention

- drop claude_home/ and *.claude.json.backup.* trees from dflash/bench/results/2026-05-2*/ — kept only metrics.txt + client.out per condition
- ignore dflash/bench/results/ going forward (new runs won't drag claude_home into git)
- move PLAN.md → thoughts/2026-05-21_pflash_mvp_plan.md (project convention; prior plans live there)
- delete dflash/bench/run_day4_ab.sh — superseded by run_day5_abc.sh per its own header
Replace `submodules: recursive` in actions/checkout with an explicit
git-submodule-update step that injects a PAT via insteadOf when
secrets.SUBMODULE_PAT is set. The GITHUB_TOKEN issued for fork PRs
does not have cross-repo access to private org repos (Luce-Org/llama.cpp-dflash-ggml),
causing intermittent "could not read Username" failures. With a PAT the
auth is explicit and stable.

Requires: add secret SUBMODULE_PAT (classic PAT, repo scope on Luce-Org)
to Luce-Org/lucebox-hub repo settings -> Secrets -> Actions.
- NIAH 16K: 5/5 baseline (keep=0.20) and 5/5 bandit (keep=0.10); no retrieval degradation
- NIAH 32K: 5/5 baseline and 5/5 bandit; compression 5x->10x halves target prefill time
- 3-seed Day-5 A/B/C: decode_check / logic_check / math_check prompts, all ok_done=YES
- Pareto: C (bandit) wall=16.3±3.4s vs B wall=24.7±3.1s (1.52x); ar=34.6% vs 32.8%
- Bandit fired in all 3 sessions; per-session state isolation confirmed
…bandit MVP)

- Add float accept_rate = 0.0f to GenerateResult struct (model_backend.h)
- Thread out_accept_rate through do_spec_decode signature; populate from n_accept_sum/total_draft_pos after spec-decode loop
- AR fallback and no-draft paths leave accept_rate = 0.0 (correct sentinel)
- Expose accept_rate in usage block of all three response formats (OPENAI_CHAT, ANTHROPIC, RESPONSES)
- 6 new unit tests in test_server_unit.cpp; 154 assertions, 0 failures; ctest 1/1 PASSED
- MTP path (line 1225 per original plan) does not exist at current HEAD — no stub needed; DFlash chain is the only spec-decode path in qwen35_backend.cpp
… (seeds #1, #2)

- StubServer: ThreadingHTTPServer recorder, zero new deps (mirrors llamacpp_compat_proxy.py pattern)
- Seed #2 green: proxy injects session_id on /v1/messages, preserves existing, passes through GET
- Seed #1 documented: chat/completions round-trip passes; injection assertion commented out pending commit 3
…#1)

- Add INJECT_ROUTES frozenset: /v1/messages, /v1/chat/completions, /v1/responses
- do_POST checks route_base in INJECT_ROUTES (query-string-safe)
- Seed #1 green: chat/completions round-trip injects session_id
- Add /v1/responses injection test (codex route)
…-Org#3)

- Add preflight_require_bin(): exit 78 + actionable asdf hint when binary missing
- Flip LUCEBOX_SERVER_BACKEND default: python → cpp (plan requirement)
- 4 tests green: missing binary exits 78 with asdf hint; present binary exits 0
- run_codex, run_pi, run_opencode, run_hermes: call preflight before start_lucebox_server
- run_claude_code: add preflight for claude binary + export LUCEBOX_SERVER_BACKEND=cpp
- bash -n clean on all 5 scripts
… (seed Luce-Org#5)

- BanditRunMetrics dataclass: accept_rate/wall_s/tokens all Optional
- parse_bandit_log_line(): None for absent fields, not "N/A" strings
- 6 tests green; Day-4-v2 missing accept_rate fixture passes without N/A leak
…mand (seeds Luce-Org#4, Luce-Org#6)

- _BaseAdapter: preflight_check (shutil.which) + dry_run returning AdapterResult
- ClaudeCodeAdapter, HermesAdapter, CodexAdapter, PiAdapter, OpenCodeAdapter
- run_bandit(): preflight → dry/live run → CSV writer (6 columns per exit-gate spec)
- bandit subcommand + top-level --condition/--clients shorthand preserved
- Seed Luce-Org#4 green: dry_run returns AdapterResult with session_id
- Seed Luce-Org#6 green: 5-adapter dry-run emits 5-row CSV with required columns
- --adapter <name> as single-client alias for --clients (exit-gate for commit 7)
- --clients/--condition can be top-level flags (no subcommand required)
- cmd_bandit handles both --adapter and --clients, default condition C_bandit
- 2 CLI subprocess tests added
…ADME

- _BaseAdapter.live_run(): subprocess into run_<client>.sh with PFLASH_SESSION_ID
- Each concrete adapter overrides live_run() with the right script path
- run_bandit() live mode calls adapter.live_run() instead of dry_run stub
- Delete run_codex.sh, run_hermes.sh, run_opencode.sh, run_pi.sh (ported to Python)
- README: headless bandit invocation + single-client bash section
- cpp backend (default): resolves dflash_server binary via DFLASH_SERVER_BIN or dflash/build/dflash_server
- python backend (opt-in): uses dflash/scripts/server.py as before
- RuntimeError with actionable message when cpp binary missing
…oken asdf shims

- base preflight_check probes with --version, checks exit code + stderr for asdf shim markers
- CodexAdapter/PiAdapter override with --help (codex/pi don't support --version)
- fail closed on timeout; emit actionable message naming the reshim command
…to deleted bash scripts)

- CodexAdapter: writes temp config.toml, invokes codex exec directly
- PiAdapter: writes temp models.json + settings.json, invokes pi directly
- HermesAdapter: drives hermes chat --provider lucebox via /v1/chat/completions
- OpenCodeAdapter: writes temp opencode.json, invokes opencode run in project dir
…attern

- README: run_codex/hermes/opencode/pi.sh refs → python3 -m harness.client_test_runner bandit --clients <name>
- run_backend_pair.sh: codex/pi/hermes/opencode case arms invoke Python runner; bash path kept for claude_code/openclaw/openwebui*
- CLIENT_SCRIPT="" sentinel routes python-adapter clients through new branch in run_backend()
…own flags

dflash_server (C++) requires the target model as argv[1] and rejects
unknown options with exit 2. Two compounding bugs killed the harness-managed
server, leaving server.log empty and accept_rate blank in the bandit CSV:

1. start_server (cpp branch) passed the target via --target — no such flag
   in dflash/src/server/server_main.cpp; argv[1] starting with '-' triggers
   the usage banner at server_main.cpp:158-160.
2. BANDIT_SERVER_PROFILE carried four Python-server-only flags (--budget,
   --verify-mode, --prefix-cache-slots, --prefill-cache-slots) the C++
   parser rejects via server_main.cpp:295-298.

With those gone the server stays up and writes [pflash]/[spec-decode]
lines that run_bandit + metrics_parser already wire into AdapterResult.

Regression tests:
- TestRunBanditWiresAcceptRate exercises run_bandit directly (previous
  tests only re-implemented the wiring inline).
- TestBanditServerProfileHasPflash::test_bandit_server_profile_only_cpp_recognised_flags
  guards against future stale-flag drift.
…rmes/opencode preflights

- ClaudeCodeAdapter.live_run now calls `claude --print` directly via subprocess
  with ANTHROPIC_BASE_URL/ANTHROPIC_API_KEY/CLAUDE_CODE_API_BASE_URL env vars;
  no second server spawn, generation-heavy 700-word prompt ensures bandit cycles
- HermesAdapter/OpenCodeAdapter preflight_check return False with honest reasons
  (HERMES_CONFIG_BUG / PROVIDER_CONFIG_BUG) instead of binary-check false-positives
- BANDIT_SERVER_PROFILE and PFLASH/BANDIT server profiles: remove unsupported --lazy-draft flag
- metrics_parser.extract_accept_rate_from_log: parses plain-text [pflash-bandit] accept=... lines
- 47 tests green (+9 new regression tests for all of the above)
- Add bandit-session subcommand: starts server once, runs N turns of
  claude_code with same session_id, captures per-turn keep_ratio trajectory
- Add BanditTurnRecord dataclass + parse_bandit_session_from_log to
  metrics_parser: parses [pflash-bandit] keep=A->B lines per turn
- Add 4 prompt files (logic_check, math_check, code_gen, explain_algo)
  for generation-heavy multi-turn runs
- Write results to /tmp/harness_adaptive_evidence.csv and
  dflash/bench/results/YYYY-MM-DD_adaptive_evidence/
- Sanity check: warns if keep_after is stuck across all turns
- +9 tests (56 total, all green)
- Live run: 5 turns, keep_after 0.1100→0.1200→0.1300→0.1400→0.1500
- HermesAdapter.preflight_check: real binary check replaces hard-coded
  HERMES_CONFIG_BUG skip; passes when hermes binary is present and
  --version exits 0
- HermesAdapter.live_run: write temp HERMES_HOME/config.yaml with
  correct base_url + context_length overrides (model and auxiliary.
  compression) so hermes 0.14 doesn't reject the 32K server context
- Start session-inject proxy before hermes so [pflash-bandit] lines
  fire in server.log (same pattern as ClaudeCodeAdapter)
- _start_session_inject_proxy: default to free_port() instead of
  hardcoded 18082 to avoid collisions when server runs on that port
- Verified: [pflash-bandit] session=hermes-bandit-test-002 turn=1
  keep=0.1000->0.1100 ema=0.123 accept=0.123
- .notes/harness-followups.md -> thoughts/2026-05-23_harness_followups.md
- removes .notes/ dir (now empty)
- aligns with project convention: thoughts/ for dated notes
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 73 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/server/adaptive_keep_ratio.h
…ted HOME

- Add _resolve_nvm_bin() helper: tries nvm node versions in preference order
  (v24.13.0, v22.17.0, v20.18.0) bypassing asdf shims — same heuristic as commit 2600108
- codex/pi preflight_env: use real HOME (asdf shims need it to resolve node)
- opencode: new preflight_env + preflight_check (was permanent FAIL stub)
- codex/pi/opencode live_run: use _resolve_nvm_bin fallback + prepend nvm node
  bin dir to PATH so node resolves when HOME is overridden to temp dir
- opencode live_run: write config to XDG_CONFIG_HOME/opencode/opencode.json
  (global config location, not project-level opencode.json in project dir)
…max_ctx 65K->49K

- bandit-session now starts one session-inject proxy for the whole session so all
  turns share the same session_id; enables prefix-cache warmup across turns (turn 2+
  should show delta-token prefill instead of full-context prefill)
- BANDIT_SERVER_PROFILE: max_ctx 32768->49152, keep-ratio 0.10->0.05, add
  --prefill-skip-park (eliminates park/unpark overhead on 24 GB GPUs)
When BANDIT_SERVER_PROFILE (needs_prefill_drafter=True) is used, the cpp
backend was also passing --draft to dflash_server, triggering an arch check
that rejects plain qwen3 models. Only pflash-aware dflash-draft arch models
pass this check, but the bandit profile only needs --prefill-drafter.

Fix: skip --draft in the cpp start_server args when the profile already
handles the drafter via needs_prefill_drafter.
sessions_ map grew unbounded when clients sent unique session_ids.
Replace flat unordered_map with an LRU structure capped at
DFLASH_BANDIT_MAX_SESSIONS (default 1024). On overflow the
least-recently-used session is evicted; get_* calls count as touches.

Two new unit tests: lru_cap_evicts_oldest, lru_touch_updates_eviction_order.
All 29 tests pass.
Copy link
Copy Markdown
Collaborator Author

@dusterbloom dusterbloom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 17525ea: replaced the flat unordered_map with an LRU structure capped at DFLASH_BANDIT_MAX_SESSIONS (default 1024). On overflow the least-recently-used session is evicted; reads (get_keep_ratio, get_ema, turn_count) count as touches. Two new unit tests added: lru_cap_evicts_oldest and lru_touch_updates_eviction_order — all 29 pass.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

You’re at about 90% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/server/adaptive_keep_ratio.h">

<violation number="1" location="dflash/src/server/adaptive_keep_ratio.h:61">
P1: DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

max_sessions_ = max_sessions;
} else {
const char* env = std::getenv("DFLASH_BANDIT_MAX_SESSIONS");
max_sessions_ = (env && *env) ? static_cast<size_t>(std::atol(env)) : 1024;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/server/adaptive_keep_ratio.h, line 61:

<comment>DFLASH_BANDIT_MAX_SESSIONS parsed with std::atol without negative-value validation; negative env values wrap to SIZE_MAX, silently disabling LRU eviction</comment>

<file context>
@@ -44,40 +46,93 @@ inline AdaptiveKeepRatioState step_adaptive_keep_ratio(
+            max_sessions_ = max_sessions;
+        } else {
+            const char* env = std::getenv("DFLASH_BANDIT_MAX_SESSIONS");
+            max_sessions_ = (env && *env) ? static_cast<size_t>(std::atol(env)) : 1024;
+        }
+        if (max_sessions_ == 0) max_sessions_ = 1024;  // guard against env=0
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Merge latest origin/main into the integration worktree, re-enumerate open contributor PRs, and record merge-tree/worktree conflict results. PR Luce-Org#266 received an actual merge attempt; other pending contributor PRs were classified with merge-tree evidence.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Merge latest origin/main (d947c70) into the integration stack and record the current PR classification. PR Luce-Org#266 was attempted again in an isolated worktree and remains blocked pending selective harness/server porting; Codex/Claude delegated resolution is unavailable due auth.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Integrates PR Luce-Org#266 into the auto-integration stack over easel/auto-integration. Resolves server/ layout conflicts by keeping the current server tree, retaining existing harness adapters, and packaging the PR's metrics parser/session proxy/tests under harness/src/harness for uv workspace imports.\n\nVerification:\n- python3 -m py_compile harness/client_test_runner.py harness/clients/session_inject_proxy.py harness/src/harness/metrics_parser.py harness/src/harness/tests/*.py\n- uv run --extra dev --package harness pytest harness/src/harness/tests -q (56 passed)\n- git diff --check\n- conflict marker scan (no conflict markers)\n- C++ target build skipped: server/build missing in worktree
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant