feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree by easel · Pull Request #285 · Luce-Org/lucebox-hub

easel · 2026-05-27T17:59:21Z

This PR turns Lucebox into a one-command local inference deployment and ships the two tools that operate it: lucebox (the host CLI that runs and tunes the server) and luce-bench (the benchmark + grading framework that measures it). All three ship together so a fresh box goes from nothing to a tuned, benchmarked server with a single install.

The three pieces, what each is, and how to use it:

1. Docker — the server image

A CUDA 12.8 image (ghcr.io/luce-org/lucebox-hub:cuda12) that builds the dflash server and bundles server/, lucebox/, harness/, and luce-bench/. The entrypoint dispatches serve (default), benchmark, any lucebox subcommand, or shell. An in-container autotune fallback picks VRAM-tiered defaults and resolves the draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6).

Use it directly:

docker run --rm --gpus all -p 8080:8080 \
  -v ~/.local/share/lucebox/models:/opt/lucebox-hub/server/models \
  ghcr.io/luce-org/lucebox-hub:cuda12
# OpenAI + Anthropic-compatible API on :8080
curl -s http://localhost:8080/v1/models

Image tags: :cuda12, :vX.Y.Z-cuda12, :X.Y-cuda12, :sha-<short>-cuda12. Built and pushed by .github/workflows/docker.yml; docker-bake.hcl has a cuda13 slot ready.

2. `lucebox` — the host CLI

lucebox.sh is the host-side wrapper (deps: docker + nvidia-smi only). It probes the host, writes a tuned config.toml, runs the container as a user-systemd service, and delegates provisioning/workloads to the in-container Python CLI (models, autotune, profile, smoke, config, the client drivers).

Stand a server up:

lucebox check            # driver / docker / NVIDIA Container Toolkit / VRAM / systemd / WSL2 probe
lucebox pull             # docker pull the cuda12 image
lucebox models download  # pull target + DFlash draft GGUFs  (verbs: list, download)
lucebox autotune         # VRAM-tiered DFLASH_* defaults → ~/.lucebox/config.toml  (autotune --sweep picks a winner empirically)
lucebox install          # install the user-systemd unit
lucebox start            # bring it up   (enable = start at every login)
lucebox status           # unit state + the server's startup banner
lucebox logs             # follow the journal
lucebox smoke            # props/tools/http/1-token health check

Tune it to the GPU:

lucebox profile          # level1/2/3 sweep over DFLASH_MAX_CTX × DFLASH_BUDGET ×
                         # {KV type, pFlash mode, lazy-draft, prefix-cache slots},
                         # gated on capability + ds4-eval/agentic validation before
                         # the winner merges into config.toml

The running config is observable at GET /props (schema 4), which now reports a host block — kernel, OS, WSL vs native, driver, CPU, RAM, GPU — so a server self-describes its real config and host.

3. `luce-bench` — the benchmark + grading framework

In-tree workspace member (luce-bench/, 0.2.7.dev0) that scores any OpenAI/Anthropic-compatible endpoint and writes versioned, comparable result files. Areas: smoke, ds4-eval (92 reasoning items), gsm8k, truthfulqa-mc1, hellaswag, code, longctx, agent, agent_recorded, forge. Every result stamps a per-area grader_version and a host block (from /props.host, or a clearly-marked client-side fallback for servers without /props).

Run it:

uvx --from 'git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench' \
  luce-bench --base-url http://localhost:8080 --model dflash --areas all --no-think

Thinking control is portable. Each request carries three control shapes (chat_template_kwargs.enable_thinking, Anthropic thinking:{type}, reasoning_effort). For servers that ignore the API flags (e.g. OpenRouter), --prompt-thinking-control {auto,on,off} (default auto) injects the model family's in-band token (/no_think, /think); auto fires only when /props shows no server-side enforcement. A post-run verifier records thinking_control_honored so a nothink run that secretly reasoned is flagged, not silently mislabeled.

Comparing results: runs from one grader version are comparable as written. For older snapshots graded by a different version, luce-bench regrade <dirs> re-scores stored outputs at the current pinned grader and refuses to place mismatched-version (or mismatched-host) runs in the same row. report / snapshot / submit-baseline round out the reporting surface.

Also in this PR

harness/ — drives real clients (claude_code, codex, opencode, hermes, pi, openclaw) against a running server; lucebox profile delegates bench runs here.
Model-card sidecars — share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json + _schema.json, so the server resolves sampler defaults, thinking budgets, and the force-close hint per model.
Workspace — pyproject.toml declares all members (server, lucebox, luce-bench, harness, optimizations/{megakernel,pflash}); [tool.uv.sources] luce-bench = { workspace = true } replaces the prior git-tag pin. release-luce-bench.yml publishes to PyPI on luce-bench-v* tags.
Docs — README quick start + hardware/env reference; server/docs/ benchmark-snapshot spec and experiment write-ups.
Removes the obsolete server/scripts/bench_*.py (their work now lives in luce-bench).

Out of scope / follow-ups

Gemma 4 31B backend wiring beyond what its model card ships (validated empirically @ 24 GB, AR-only).
gemma4 MoE expert split.
Multi-Token Prediction (upstream, draft).

Validation

uv sync clean on the workspace; luce-bench test suite passes.
Full --areas all sweeps run end-to-end against bragi (RTX 5090 Laptop), sindri (RTX 3090 Ti), vidar (M2 Ultra / MLX), and OpenRouter, think and nothink, all on one grader version.
/props.host confirmed populated on lucebox servers (bragi + sindri report WSL2); OpenRouter nothink confirmed honored via client-side /no_think injection.

easel · 2026-05-27T19:53:47Z

Some commands to test this... copied from the readme.

Install the lucebox wrapper:

curl -fsSL https://raw.githubusercontent.com/easel/lucebox-hub/feat/lucebox-docker/lucebox.sh \
       -o ~/.local/bin/lucebox.sh && chmod +x ~/.local/bin/lucebox.sh

Run lucebox using the docker image

# Override the container image to the temporary build:
export LUCEBOX_IMAGE=ghcr.io/easel/lucebox-hub

# Check your machine for lucebox compatibility
lucebox check

# Start the lucebox server
lucebox serve

Run benchmarks against a local server:

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --url http://localhost:1236

Run benchmarks against open router

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --base-url https://openrouter.ai/api --model qwen/qwen3.6-27b --auth-env OPENROUTER_API_KEY

…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.

…de env vars)

At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.

Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).

When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.

…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.

…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.

- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)

…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.

… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.

Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)

…nch in-tree Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main. Net: 189 files changed. Major workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard, container-state preflight), cmd_systemctl_passthrough (already- active short-circuit, restart-loop detection), cmd_update (bootstrap- installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default precedence), shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so lucebox update keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → 183 lines). * luce-bench: snapshot subcommand + canonical HostInfo schema v2 + levels (level0/1/2/3) + report subcommand + submit-baseline + regrade. * Server (C++): /props.host block + props_schema=4 + host_info read at startup, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. * Harness: client implementations for claude/codex/opencode/hermes/pi. * Strict 11-field config.toml allowlist for dflash.* runtime tunables. Deleted (rolled into new structure): * server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by luce-bench snapshot + areas. * lucebox configure, lucebox download-models, lucebox benchmark — replaced by config sub-app, models sub-app, autotune --sweep. * luce-bench --sweep flag — moved to argv-sniff subcommand dispatch. Conflict resolution: * server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion (feat/lucebox-docker moved bench machinery into luce-bench). * README.md — took feat-branch version. origin/main had 19 commits worth of minor README tweaks since the branch base; those need to be folded back in as a follow-up PR. * docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took feat-branch version. origin/main had 1 link-fix commit; feat-branch has the schema-4 + host-block additions that strictly supersede. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_load_or_build()` returned `config_mod.load()`'s result verbatim when config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That contradicted the precedence lucebox.sh documents (env > toml > default) and bit sindri in production: its config.toml had `[image]` without a `registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub` beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`. Symptom: `lucebox start` brought up the wrong (stale luce-org) image even after explicit `lucebox install` + `lucebox pull` against easel. Fix: overlay env on top of whatever `load()` returns (or `live_config()` falls back to). Only the five top-level scalars have env hooks (LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model intentionally don't. Adds two regression tests: - env beats config.toml when toml has no explicit value for that key, - env still wins when toml is absent (covers the live_config fallback). 102 lucebox tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q

Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.

- test_autotune_candidate_configs.py: sort imports (ruff I001). - download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo (correct — we only query model repos here), resolving the mypy union-attr error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Luce-Org#10) Closes the two validated pieces of the adaptive-keep path (the label-free quality-reward idea was dropped — Momus-confirmed it can't catch confident off-task). Default-OFF; router gates these to agentic-routed requests. - regime_router.h: two pure helpers (stdlib-only, TDD'd) — clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer silently undercut the router's 0.25 floor. compression_failed(tokens, degenerate_close, agentic_compressed, min=8): true on empty/degenerate output of an agentic compressed turn. - adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume). - http_server.cpp: floor clamp at keep-apply; at the post-generate update site, on compression_failed → skip the bandit update (failure noise) and set the session to full keep for the next turn (deterministic recovery from the empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env (default 8) tunes the guard threshold. - 59 standalone unit tests, -Werror. LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts): - type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on. - guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13, bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0. - floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250. Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the 3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF via PFLASH_ROUTER_ENABLE.

The 2026-05-30 gemma full bench scored forge 0/30 cases with ``error_type=ValidationError`` on every row. Two stacked bugs: 1. The recording client called ``TextResponse(text=...)`` but the forge ``TextResponse`` field is named ``content`` — every send() raised a pydantic ValidationError, which surfaced as the per-row error_type. (Independent bug, fixed in one line: text=→content=.) 2. Even with Luce-Org#1 fixed, gemma emits ``call:get_country_info{country: "France"}call:summarize{text: "..."}`` as plain text in a ``text`` content block — not as Anthropic ``tool_use`` structured blocks — so the old client surfaced text-only responses and forge would have nudged forever waiting for a tool call. This patch scans the assistant text for ``call:<verb>{args}`` invocations, parses the args as relaxed JSON (json.loads first, then a permissive pass that quotes bare keys), and synthesizes ``ToolCall`` entries that forge's WorkflowRunner consumes natively. Malformed args are dropped (per-call, not per-response) so a single mangled invocation doesn't crash the bench. The forge LLMResponse contract is ``list[ToolCall] | TextResponse`` (forge_eval._forge.core.workflow), so synthesis stays within the existing types — no anthropic.types.Message construction needed. Why client-side: the server's chat_template / SSE emitter could translate the plain-text shape into Anthropic tool_use blocks upstream (cleaner long-term), but that's a C++ change with broader scope. The client-side path also future-proofs the bench for any other model that uses the same plain-text tool serialization (codex-mini, DDX bead executor, etc.) — same intent already recognized in lucebench.areas.agent's _CALL_INVOCATION pattern. Tests cover the parsing/synthesis helper in isolation: empty input, single calls, back-to-back calls, snake_case + kebab-case + ns:verb names, nested braces, strings containing } chars, unbalanced braces, and unparseable args. Full test suite remains green (291 passed, +16 from this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update the auto-integration manifest after PR Luce-Org#285 advanced during the cron run. Record the clean merge, draft list change, retained worktree, and luce-bench Forge grader validation.

…squashed) Brings in the full pflash prefill-compression system as a single revertible commit. Default-OFF behind PFLASH_ROUTER_ENABLE=1; requires Qwen3-0.6B drafter weights to activate. Key capabilities merged from pflash/ee7: - ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard - Adaptive keep-ratio / anchor_radius (eliminates 64K NIAH cliff) - Adaptive compression-regime router (type-gate: agentic=0.25, retrieval=full) - Adaptive fa_window composition via per-request override - PFLASH_*/DFLASH_* dual env-var aliasing with transitive cascade defaults - Empty-response guard + bandit floor reconciliation - Closed <think> prefill injection in Jinja renderer for Qwen3 nothink mode - eval_quality_compare.py for LongBench F1 regression detection - New test suites: anchor_transitive, drafter regression, regime_router Conflicts resolved: - .gitignore: kept both lucebox-hub entries and pflash backup-suffix entries - chat_template.cpp: merged Qwen3 closed-think suffix injection into our PromptRenderResult return path - test_server_unit.cpp: kept started_in_thinking regression suite (HEAD) and adapted pflash's 5 Qwen3 closed-think tests to use PromptRenderResult.text Original 16-commit range: d4546a5..8fc961b (pflash/ee7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cebox-docker

# Conflicts: # README.md # server/src/common/model_backend.h # server/src/qwen35/qwen35_backend.cpp

…_unit - scripts/pflash_session_bench.py: standalone A/B benchmark for pflash using the multi-turn session fixture (8K-131K token cases). Sends the largest case fitting the server's max_ctx and reports wall/decode timing. Use --bucket to select a specific tier. - Dockerfile: add test_server_unit to cmake build targets so the template-coverage regression suite ships in the image for CI checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/pflash_session_bench.py">

<violation number="1">
P2: `decode` TPS is mislabeled: it is computed from total wall time, not decode time.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-05-31T15:25:39Z

@@ -0,0 +1,156 @@
+#!/usr/bin/env python3


P2: decode TPS is mislabeled: it is computed from total wall time, not decode time.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At scripts/pflash_session_bench.py, line 143: <comment>`decode` TPS is mislabeled: it is computed from total wall time, not decode time.</comment> <file context> @@ -0,0 +1,156 @@ + wall = result["wall_s"] + in_tok = result["prompt_tokens"] + out_tok = result["completion_tokens"] + tps = out_tok / wall if wall > 0 else 0 + print(f" wall={wall:.1f}s in={in_tok} out={out_tok} " + f"decode={tps:.1f}tok/s chars={result['content_chars']} " </file context>

The raw vocab token for Gemma4's thinking channel opener is "<|channel>thought" (id 100), not "<|channel>". The previous equality check `raw == "<|channel>"` never matched, so the token fell through to the <|...|> skip filter but leaked as literal text "thought\n" into code completions, causing HumanEval code=0%. Fix: change both streaming and non-streaming paths to `raw.starts_with("<|channel>")`. This was tracked as follow-up Luce-Org#3 in docs/experiments/gemma4-26b-thinking-control-2026-05-25.md. Requires image rebuild to take effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai · 2026-05-31T15:31:46Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

The sanity-check RUN step already verifies test_dflash and dflash_server exist. Add test_server_unit so a failed test-binary build (e.g. a future build target removal) is caught at image-build time rather than silently shipping without the test binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Captures the diagnosis (gemma forge 0/30 on 2026-05-30), the proposed sixth detection pattern, the relaxed-JSON arg parser sketch, the unit-test matrix, and codex's review (which forced reordering the new pattern to slot Luce-Org#5 ahead of the bare-JSON sweep to avoid interception of nested name/arguments-shaped args). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a sixth detection pattern to `parse_tool_calls` that recognizes the plain-text tool invocations gemma emits in chat-completion content (`call:get_country_info{country: "France"}` / `call:execute-bead:read-file{path: "..."}` / etc). The 2026-05-30 gemma full bench scored forge 0/30 because every row's output carried these `call:<verb>{...}` invocations as text rather than structured `tool_use` content blocks. None of the existing five envelope-shaped detectors (`<tool_call>`, `<function=...>`, `<tool_code>`, bare JSON) match the bare `call:` shape. The new pattern: - Anchors on a sentinel character (whitespace, comma, semicolon, open/close bracket, etc.) before `call:` so narrative usages like `narrative.call:foo` don't match. - Supports namespaced verbs (`execute-bead:read-file`, `default_api:fetch_sales_data`) and strips the namespace before using the verb as the ToolCall name. - Extracts the args block via a quote- and escape-aware balanced-brace scanner that tolerates `"`, `'`, and `` ` `` string literals and tracks `[]` depth alongside `{}`. - Parses the args as strict JSON first, then falls back to a relaxed rewrite that quotes bare identifier keys and normalizes single/ backtick quoted strings to double-quoted before retrying. Malformed args drop the single invocation without crashing or polluting other calls. - Runs *before* the bare-JSON sweep so that inner args of the form `call:outer{"name": "inner", "arguments": {}}` aren't hijacked into a spurious `inner` ToolCall by pattern Luce-Org#6. Downstream the existing wiring takes over: SseEmitter::accumulate already calls parse_tool_calls; a non-empty ToolCall list flips finish_reason to `tool_calls`, which the Anthropic /v1/messages branch maps to `stop_reason="tool_use"` with `tool_use` content blocks (http_server.cpp:2030-2090) and the OpenAI branch maps to `choices[].message.tool_calls`. The forge client-side workaround `_parse_plain_text_tool_calls` shipping on feat/lucebox-docker (commit deba2fd) becomes redundant once a server with this fix is deployed. It stays in place as defense-in-depth for older deployed servers. Test plan: 14 new C++ unit cases in test_server_unit.cpp covering single / back-to-back / namespaced / snake- and kebab-case verbs; tool-allowed filtering; mid-prose rejection vs. whitespace-led acceptance; malformed args drop; inner `{}` inside string literals; strict-JSON and relaxed-keys arg parsing; cleaned_text scrubbing; the codex-requested inner `name`/`arguments` interception case; and multi-line nested-array args mirroring the snapshot data. All pass in a standalone driver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port a narrow slice of PR Luce-Org#135 into the current stack: daemon cache-slot parsing, independent extra TargetCache state, graph/feature-mirror swapping, and cleanup handling. Refresh auto-integration manifest after merging advanced PR Luce-Org#285.

The server binary only accepts these three values; "compress" is silently rejected at startup with pflash falling back to off. Add a caster that raises ValueError immediately on config_set so the error is caught early rather than manifesting as a silent pflash=off at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rogress) Records methodology, baseline speed result (32K session: wall=89.3s, prefill~87s), and corrects prefill_mode="compress"→"auto" bug discovered during setup. PFlash quality and speed legs TBD after server restart. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cebox-docker

PFlash requires prefix_cache_slots>0 to work. With prefix_cache_slots=0 (current optimal config), all chunks are forced (100%), adding drafter overhead with zero compression benefit. Speed bench result: 1291/1318 chunks forced at 42K tokens → 97.9% kept. Quality benchmark running; expected ≈ baseline (pflash is a no-op). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

std::string::starts_with() is C++20 but CMakeLists.txt requires C++17. Replace with rfind("<|channel>", 0) == 0, idiomatic C++17 equivalent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the server-side call:<verb>{} tool parser fix (PR Luce-Org#323) and the C++17 compatibility fix for starts_with. Benchmarks running. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Continue the Luce-Org#135 selective-port stack with diagnostic-only SCHED_STEP and SCHED_DRAIN daemon commands. They report request counts and active/per-slot target-cache state without mutating live scheduler state. Refresh the auto-integration manifest and record the latest Luce-Org#285 head merge.

…sults - Add partial agent_recorded results (2/4 PASS vs prior 3/26 PASS) - Identify that channel routing fix likely explains agent_recorded improvement, not just the call:verb parser - Document two distinct fixes in image 1443239 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PFlash with prefix_cache_slots=0 forces all KV chunks → zero compression. Confirmed by bragi A/B test 2026-05-31. Update bracket comments and module docstring to note both the drafter file AND prefix caching requirements for pflash to be effective. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

easel force-pushed the feat/lucebox-docker branch 3 times, most recently from b5d4cc5 to 3642703 Compare May 27, 2026 18:15

easel mentioned this pull request May 27, 2026

feat(gemma4): wire prefill/decode timing into GenerateResult #287

Merged

easel force-pushed the feat/lucebox-docker branch 5 times, most recently from f2ddfc4 to 2be3eef Compare May 27, 2026 18:56

dusterbloom and others added 14 commits May 28, 2026 19:44

refactor(pflash): rename DFLASH_COMPRESS_* → PFLASH_COMPRESS_* (casca…

94907a4

…de env vars)

bench: add eval_quality_compare.py for LongBench F1 regression detection

766e46d

easel force-pushed the feat/lucebox-docker branch from 244257c to f4db35b Compare May 29, 2026 05:16

easel marked this pull request as ready for review May 29, 2026 05:23

dusterbloom and others added 2 commits May 31, 2026 10:15

easel mentioned this pull request May 31, 2026

fix(forge): synthesize tool_use from call:<verb>{} plain-text emissions #320

Closed

4 tasks

Merge bragi: think vs nothink baseline summary doc

deb5adb

easel and others added 5 commits May 31, 2026 10:05

Merge remote-tracking branch 'easel/feat/lucebox-docker' into feat/lu…

a45c9fa

…cebox-docker

Merge easel/feat/lucebox-docker: PFlash batch + chat_template gates

1122d02

Merge origin/main: spec-decode empty fallback + prefix-cache fix + docs

5b15d34

# Conflicts: # README.md # server/src/common/model_backend.h # server/src/qwen35/qwen35_backend.cpp

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

easel and others added 6 commits May 31, 2026 11:33

Merge PR Luce-Org#323: server-side call:<verb>{} tool parser

5ca695c

Merge bragi: gemma4 channel-token fix + pflash test scripts

c70ebb0

Merge bragi: Dockerfile test_server_unit guard

329f611

easel mentioned this pull request May 31, 2026

fix(server): parse gemma's call:<verb>{} plain-text tool emissions #323

Closed

13 tasks

easel and others added 6 commits May 31, 2026 11:55

Merge remote-tracking branch 'easel/feat/lucebox-docker' into feat/lu…

d339916

…cebox-docker

fix(server): replace C++20 starts_with with C++17 rfind

1443239

std::string::starts_with() is C++20 but CMakeLists.txt requires C++17. Replace with rfind("<|channel>", 0) == 0, idiomatic C++17 equivalent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(experiments): add Gemma4 call:verb parser fix verification doc

aa00f49

Documents the server-side call:<verb>{} tool parser fix (PR Luce-Org#323) and the C++17 compatibility fix for starts_with. Benchmarks running. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

easel and others added 2 commits May 31, 2026 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285
easel wants to merge 72 commits into
Luce-Org:mainfrom
easel:feat/lucebox-docker

easel commented May 27, 2026 •

edited

Loading

Uh oh!

easel commented May 27, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Uh oh!

cubic-dev-ai Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Docker — the server image

2. lucebox — the host CLI

3. luce-bench — the benchmark + grading framework

Also in this PR

Out of scope / follow-ups

Validation

Uh oh!

easel commented May 27, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

easel commented May 27, 2026 •

edited

Loading

2. `lucebox` — the host CLI

3. `luce-bench` — the benchmark + grading framework