Skip to content

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285

Open
easel wants to merge 72 commits into
Luce-Org:mainfrom
easel:feat/lucebox-docker
Open

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285
easel wants to merge 72 commits into
Luce-Org:mainfrom
easel:feat/lucebox-docker

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented May 27, 2026

This PR turns Lucebox into a one-command local inference deployment and ships the two tools that operate it: lucebox (the host CLI that runs and tunes the server) and luce-bench (the benchmark + grading framework that measures it). All three ship together so a fresh box goes from nothing to a tuned, benchmarked server with a single install.

The three pieces, what each is, and how to use it:


1. Docker — the server image

A CUDA 12.8 image (ghcr.io/luce-org/lucebox-hub:cuda12) that builds the dflash server and bundles server/, lucebox/, harness/, and luce-bench/. The entrypoint dispatches serve (default), benchmark, any lucebox subcommand, or shell. An in-container autotune fallback picks VRAM-tiered defaults and resolves the draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6).

Use it directly:

docker run --rm --gpus all -p 8080:8080 \
  -v ~/.local/share/lucebox/models:/opt/lucebox-hub/server/models \
  ghcr.io/luce-org/lucebox-hub:cuda12
# OpenAI + Anthropic-compatible API on :8080
curl -s http://localhost:8080/v1/models

Image tags: :cuda12, :vX.Y.Z-cuda12, :X.Y-cuda12, :sha-<short>-cuda12. Built and pushed by .github/workflows/docker.yml; docker-bake.hcl has a cuda13 slot ready.

2. lucebox — the host CLI

lucebox.sh is the host-side wrapper (deps: docker + nvidia-smi only). It probes the host, writes a tuned config.toml, runs the container as a user-systemd service, and delegates provisioning/workloads to the in-container Python CLI (models, autotune, profile, smoke, config, the client drivers).

Stand a server up:

lucebox check            # driver / docker / NVIDIA Container Toolkit / VRAM / systemd / WSL2 probe
lucebox pull             # docker pull the cuda12 image
lucebox models download  # pull target + DFlash draft GGUFs  (verbs: list, download)
lucebox autotune         # VRAM-tiered DFLASH_* defaults → ~/.lucebox/config.toml  (autotune --sweep picks a winner empirically)
lucebox install          # install the user-systemd unit
lucebox start            # bring it up   (enable = start at every login)
lucebox status           # unit state + the server's startup banner
lucebox logs             # follow the journal
lucebox smoke            # props/tools/http/1-token health check

Tune it to the GPU:

lucebox profile          # level1/2/3 sweep over DFLASH_MAX_CTX × DFLASH_BUDGET ×
                         # {KV type, pFlash mode, lazy-draft, prefix-cache slots},
                         # gated on capability + ds4-eval/agentic validation before
                         # the winner merges into config.toml

The running config is observable at GET /props (schema 4), which now reports a host block — kernel, OS, WSL vs native, driver, CPU, RAM, GPU — so a server self-describes its real config and host.

3. luce-bench — the benchmark + grading framework

In-tree workspace member (luce-bench/, 0.2.7.dev0) that scores any OpenAI/Anthropic-compatible endpoint and writes versioned, comparable result files. Areas: smoke, ds4-eval (92 reasoning items), gsm8k, truthfulqa-mc1, hellaswag, code, longctx, agent, agent_recorded, forge. Every result stamps a per-area grader_version and a host block (from /props.host, or a clearly-marked client-side fallback for servers without /props).

Run it:

uvx --from 'git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench' \
  luce-bench --base-url http://localhost:8080 --model dflash --areas all --no-think

Thinking control is portable. Each request carries three control shapes (chat_template_kwargs.enable_thinking, Anthropic thinking:{type}, reasoning_effort). For servers that ignore the API flags (e.g. OpenRouter), --prompt-thinking-control {auto,on,off} (default auto) injects the model family's in-band token (/no_think, /think); auto fires only when /props shows no server-side enforcement. A post-run verifier records thinking_control_honored so a nothink run that secretly reasoned is flagged, not silently mislabeled.

Comparing results: runs from one grader version are comparable as written. For older snapshots graded by a different version, luce-bench regrade <dirs> re-scores stored outputs at the current pinned grader and refuses to place mismatched-version (or mismatched-host) runs in the same row. report / snapshot / submit-baseline round out the reporting surface.


Also in this PR

  • harness/ — drives real clients (claude_code, codex, opencode, hermes, pi, openclaw) against a running server; lucebox profile delegates bench runs here.
  • Model-card sidecarsshare/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json + _schema.json, so the server resolves sampler defaults, thinking budgets, and the force-close hint per model.
  • Workspacepyproject.toml declares all members (server, lucebox, luce-bench, harness, optimizations/{megakernel,pflash}); [tool.uv.sources] luce-bench = { workspace = true } replaces the prior git-tag pin. release-luce-bench.yml publishes to PyPI on luce-bench-v* tags.
  • Docs — README quick start + hardware/env reference; server/docs/ benchmark-snapshot spec and experiment write-ups.
  • Removes the obsolete server/scripts/bench_*.py (their work now lives in luce-bench).

Out of scope / follow-ups

  • Gemma 4 31B backend wiring beyond what its model card ships (validated empirically @ 24 GB, AR-only).
  • gemma4 MoE expert split.
  • Multi-Token Prediction (upstream, draft).

Validation

  • uv sync clean on the workspace; luce-bench test suite passes.
  • Full --areas all sweeps run end-to-end against bragi (RTX 5090 Laptop), sindri (RTX 3090 Ti), vidar (M2 Ultra / MLX), and OpenRouter, think and nothink, all on one grader version.
  • /props.host confirmed populated on lucebox servers (bragi + sindri report WSL2); OpenRouter nothink confirmed honored via client-side /no_think injection.

@easel easel force-pushed the feat/lucebox-docker branch 3 times, most recently from b5d4cc5 to 3642703 Compare May 27, 2026 18:15
@easel easel force-pushed the feat/lucebox-docker branch 5 times, most recently from f2ddfc4 to 2be3eef Compare May 27, 2026 18:56
@easel
Copy link
Copy Markdown
Collaborator Author

easel commented May 27, 2026

Some commands to test this... copied from the readme.

Install the lucebox wrapper:

curl -fsSL https://raw.githubusercontent.com/easel/lucebox-hub/feat/lucebox-docker/lucebox.sh \
       -o ~/.local/bin/lucebox.sh && chmod +x ~/.local/bin/lucebox.sh

Run lucebox using the docker image

# Override the container image to the temporary build:
export LUCEBOX_IMAGE=ghcr.io/easel/lucebox-hub

# Check your machine for lucebox compatibility
lucebox check

# Start the lucebox server
lucebox serve

Run benchmarks against a local server:

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --url http://localhost:1236

Run benchmarks against open router

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --base-url https://openrouter.ai/api --model qwen/qwen3.6-27b --auth-env OPENROUTER_API_KEY

dusterbloom and others added 14 commits May 28, 2026 19:44
…g-42 tail-capture guard

ee7 truncates drafter forward at layer 7 of 28, scoring only those layers.
9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter).
Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF).
Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}.

5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple
chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks
~160 tokens) loses the back half of the needle digits — the model
retrieves '...is 4' but truncates/hallucinates the continuation.

Adaptive scaling based on n_chunks:
  <32K  context (<1024 chunks): radius=2,  max_anchor_hits=8   (unchanged)
  32-64K (1024-2047 chunks):    radius=4,  max_anchor_hits=16
  >=64K (>=2048 chunks):        radius=8,  max_anchor_hits=32

Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS
env vars (legacy DFLASH_COMPRESS_* names still accepted).

Validated at 49K context: NIAH needle 'kowefada 1596346' correctly
retrieved (was: '1594' or hallucinated 'is 048394839483' before fix).
Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35
target loader and the dflash decode draft loader. After loading weight
tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] /
wk->ne[1] and compare against GGUF-declared values; set_last_error
and return false on mismatch.

Makes the 'stale scalar at graph-build time' bug class structurally
impossible. Load-time only, no runtime cost. Existing well-formed
GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override =
effective_prompt + 256 so spec-decode verify sees the entire
compressed prompt. Pflash already paid compute to pick which tokens
matter; verify never throws any of them away.

When the override would exceed 2 * cfg_.fa_window (spec-decode's
drafter cost stops earning its tok/J), the C2 gate in
qwen35_backend's generate() falls back to AR (fa_window=0, full
attention). AR sees every kept token at every context; we choose
mechanism, not visibility.

Zero new CLI flags. --draft remains the only knob for composition;
all per-request adaptation is internal.
…scade default-on

Adds backwards-compat fallback wrappers for 6 cascade env vars in both
standard and bandit code paths, so harness scripts using either spelling
work against this binary. Emits one-time WARN to stderr when the legacy
DFLASH_* spelling is honored.

Also flips the default for `use_transitive` from `false` to `true` because
the gated rare-token bridge improves multi-hop F1 with zero downside in
the cascade-already-firing case.
…th drift

Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both
qwen35_score_and_compress and drafter_score_and_compress call it.
Removes two 70-LOC duplicate env-reading blocks and the duplicated
anchor-radius comment. Also removes dead force_chunk_neighborhood
(no callers) and collapses the 4-overload load_drafter pyramid to
one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks),
  remove early_exit_n alias (inline early_exit_pre at call site)
- qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref;
  do_ar_decode budget-hook 15-line → 4-line + docs ref
- http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref
- model_backend.h: BudgetHook 23-line essay → 3-line + docs ref
- gguf_target_loader.cpp: 4-line prose tail → 1-line
- .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts
- docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md,
  anchor-transitive.md (consolidated rationale)
…nking is off

The hard-coded renderer appends a closed think prefill when thinking is
disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix,
leaving the model in the wrong decoding state for tool use. Mirror the
hard-coded behavior here when the rendered prompt ends with a bare
assistant generation prompt; tolerate trailing-whitespace variants
(single \n, double \n\n, trailing space).

Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%:
the drafter was distilled with the closed-think suffix in its training
distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target
and drafter disagree on what comes after <|im_start|>assistant\n.
… only

The previous commit applied the closed-think suffix to all Jinja-rendered
prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja,
defaulting to QWEN3, and guard the post-processing block with
arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes
chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe
inherits ChatFormat::QWEN3 by design (matches drafter distillation).

5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3
arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append
when template already closes the think block.

Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in
qwen35_backend.cpp into a pure predicate header c2_gate.h.

Zero behavior change. Identical math:
  (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg)

The new header documents the empirically-derived rationale: at
compressed KV sizes (pflash compression of long prompts), T_draft/T_target
ratio approaches 1, eliminating spec-decode's profit margin over AR.
Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced
spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when
eff_fa_window > 2*fa_window_cfg.

Adds 5 unit tests locking in the predicate's behavior with explicit
Round 5 4-arm matrix bench citations.

Files:
- server/src/qwen35/c2_gate.h (new)
- server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call)
- server/test/test_server_unit.cpp (+60 LOC, 5 tests)
…nch in-tree

Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main.
Net: 189 files changed.

Major workstreams folded in:

* Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage
  Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache,
  build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO.
* Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID
  guard, container-state preflight), cmd_systemctl_passthrough (already-
  active short-circuit, restart-loop detection), cmd_update (bootstrap-
  installer pattern), cmd_completion (bash/zsh/fish), config.toml reader
  (env > toml > default precedence), shellcheck-clean.
* Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the
  installed copy so lucebox update keeps tracking the channel; refuses
  SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL.
* In-container Python CLI (lucebox/): sparse config.toml persistence,
  config get/set/unset sub-app, models list/download sub-app (replaces
  download-models), autotune with --apply / --json / --sweep, profile
  collapsed onto luce-bench snapshot (1701 → 183 lines).
* luce-bench: snapshot subcommand + canonical HostInfo schema v2 +
  levels (level0/1/2/3) + report subcommand + submit-baseline + regrade.
* Server (C++): /props.host block + props_schema=4 + host_info read at
  startup, /props.build identity, GGUF metadata + sha256 sidecars,
  model card sidecars.
* Harness: client implementations for claude/codex/opencode/hermes/pi.
* Strict 11-field config.toml allowlist for dflash.* runtime tunables.

Deleted (rolled into new structure):
* server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by
  luce-bench snapshot + areas.
* lucebox configure, lucebox download-models, lucebox benchmark — replaced
  by config sub-app, models sub-app, autotune --sweep.
* luce-bench --sweep flag — moved to argv-sniff subcommand dispatch.

Conflict resolution:
* server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion
  (feat/lucebox-docker moved bench machinery into luce-bench).
* README.md — took feat-branch version. origin/main had 19 commits worth
  of minor README tweaks since the branch base; those need to be folded
  back in as a follow-up PR.
* docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took
  feat-branch version. origin/main had 1 link-fix commit; feat-branch
  has the schema-4 + host-block additions that strictly supersede.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when
config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That
contradicted the precedence lucebox.sh documents (env > toml > default)
and bit sindri in production: its config.toml had `[image]` without a
`registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub`
beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`.
Symptom: `lucebox start` brought up the wrong (stale luce-org) image
even after explicit `lucebox install` + `lucebox pull` against easel.

Fix: overlay env on top of whatever `load()` returns (or `live_config()`
falls back to). Only the five top-level scalars have env hooks
(LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model
intentionally don't.

Adds two regression tests:
- env beats config.toml when toml has no explicit value for that key,
- env still wins when toml is absent (covers the live_config fallback).

102 lucebox tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel force-pushed the feat/lucebox-docker branch from 244257c to f4db35b Compare May 29, 2026 05:16
…g#285 CI

CI's "Lint Python surfaces touched by lucebox tooling" job ran
`ruff check .` and found 11 errors across surfaces this branch touches.
Ruff --fix handled 6 (import sorting, unused imports); 5 needed
hand-edits:

  luce-bench/src/lucebench/report.py:172  E741  rename `for l in` → `for lineup in`
  lucebox/tests/test_check.py:39, 95      E731  lambda → def stub() for the two HostFacts stubs
  lucebox/tests/test_cli.py:95            E501  wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv
  lucebox/tests/test_sweep.py:174, 177    E501  wrap two CellResult constructors

22 lucebox tests touched still pass; ruff is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel marked this pull request as ready for review May 29, 2026 05:23
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.
- test_autotune_candidate_configs.py: sort imports (ruff I001).
- download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo
  and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo
  (correct — we only query model repos here), resolving the mypy union-attr error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dusterbloom and others added 2 commits May 31, 2026 10:15
…Luce-Org#10)

Closes the two validated pieces of the adaptive-keep path (the label-free
quality-reward idea was dropped — Momus-confirmed it can't catch confident
off-task). Default-OFF; router gates these to agentic-routed requests.

- regime_router.h: two pure helpers (stdlib-only, TDD'd) —
  clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective
    keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer
    silently undercut the router's 0.25 floor.
  compression_failed(tokens, degenerate_close, agentic_compressed, min=8):
    true on empty/degenerate output of an agentic compressed turn.
- adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume).
- http_server.cpp: floor clamp at keep-apply; at the post-generate update site,
  on compression_failed → skip the bandit update (failure noise) and set the
  session to full keep for the next turn (deterministic recovery from the
  empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env
  (default 8) tunes the guard threshold.
- 59 standalone unit tests, -Werror.

LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts):
- type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on.
- guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13,
  bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0.
- floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250.
Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the
3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF
via PFLASH_ROUTER_ENABLE.
The 2026-05-30 gemma full bench scored forge 0/30 cases with
``error_type=ValidationError`` on every row. Two stacked bugs:

1. The recording client called ``TextResponse(text=...)`` but the
   forge ``TextResponse`` field is named ``content`` — every send()
   raised a pydantic ValidationError, which surfaced as the per-row
   error_type. (Independent bug, fixed in one line: text=→content=.)

2. Even with Luce-Org#1 fixed, gemma emits ``call:get_country_info{country:
   "France"}call:summarize{text: "..."}`` as plain text in a ``text``
   content block — not as Anthropic ``tool_use`` structured blocks —
   so the old client surfaced text-only responses and forge would
   have nudged forever waiting for a tool call.

This patch scans the assistant text for ``call:<verb>{args}``
invocations, parses the args as relaxed JSON (json.loads first, then
a permissive pass that quotes bare keys), and synthesizes
``ToolCall`` entries that forge's WorkflowRunner consumes natively.
Malformed args are dropped (per-call, not per-response) so a single
mangled invocation doesn't crash the bench.

The forge LLMResponse contract is ``list[ToolCall] | TextResponse``
(forge_eval._forge.core.workflow), so synthesis stays within the
existing types — no anthropic.types.Message construction needed.

Why client-side: the server's chat_template / SSE emitter could
translate the plain-text shape into Anthropic tool_use blocks
upstream (cleaner long-term), but that's a C++ change with broader
scope. The client-side path also future-proofs the bench for any
other model that uses the same plain-text tool serialization
(codex-mini, DDX bead executor, etc.) — same intent already
recognized in lucebench.areas.agent's _CALL_INVOCATION pattern.

Tests cover the parsing/synthesis helper in isolation: empty input,
single calls, back-to-back calls, snake_case + kebab-case + ns:verb
names, nested braces, strings containing } chars, unbalanced
braces, and unparseable args. Full test suite remains green (291
passed, +16 from this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Update the auto-integration manifest after PR Luce-Org#285 advanced during the cron run. Record the clean merge, draft list change, retained worktree, and luce-bench Forge grader validation.
easel and others added 5 commits May 31, 2026 10:05
…squashed)

Brings in the full pflash prefill-compression system as a single revertible
commit. Default-OFF behind PFLASH_ROUTER_ENABLE=1; requires Qwen3-0.6B
drafter weights to activate.

Key capabilities merged from pflash/ee7:
- ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard
- Adaptive keep-ratio / anchor_radius (eliminates 64K NIAH cliff)
- Adaptive compression-regime router (type-gate: agentic=0.25, retrieval=full)
- Adaptive fa_window composition via per-request override
- PFLASH_*/DFLASH_* dual env-var aliasing with transitive cascade defaults
- Empty-response guard + bandit floor reconciliation
- Closed <think> prefill injection in Jinja renderer for Qwen3 nothink mode
- eval_quality_compare.py for LongBench F1 regression detection
- New test suites: anchor_transitive, drafter regression, regime_router

Conflicts resolved:
- .gitignore: kept both lucebox-hub entries and pflash backup-suffix entries
- chat_template.cpp: merged Qwen3 closed-think suffix injection into our
  PromptRenderResult return path
- test_server_unit.cpp: kept started_in_thinking regression suite (HEAD) and
  adapted pflash's 5 Qwen3 closed-think tests to use PromptRenderResult.text

Original 16-commit range: d4546a5..8fc961b (pflash/ee7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	README.md
#	server/src/common/model_backend.h
#	server/src/qwen35/qwen35_backend.cpp
…_unit

- scripts/pflash_session_bench.py: standalone A/B benchmark for pflash
  using the multi-turn session fixture (8K-131K token cases). Sends
  the largest case fitting the server's max_ctx and reports wall/decode
  timing. Use --bucket to select a specific tier.
- Dockerfile: add test_server_unit to cmake build targets so the
  template-coverage regression suite ships in the image for CI checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/pflash_session_bench.py">

<violation number="1">
P2: `decode` TPS is mislabeled: it is computed from total wall time, not decode time.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

@@ -0,0 +1,156 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: decode TPS is mislabeled: it is computed from total wall time, not decode time.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/pflash_session_bench.py, line 143:

<comment>`decode` TPS is mislabeled: it is computed from total wall time, not decode time.</comment>

<file context>
@@ -0,0 +1,156 @@
+            wall = result["wall_s"]
+            in_tok = result["prompt_tokens"]
+            out_tok = result["completion_tokens"]
+            tps = out_tok / wall if wall > 0 else 0
+            print(f"  wall={wall:.1f}s  in={in_tok}  out={out_tok}  "
+                  f"decode={tps:.1f}tok/s  chars={result['content_chars']}  "
</file context>

The raw vocab token for Gemma4's thinking channel opener is
"<|channel>thought" (id 100), not "<|channel>". The previous equality
check `raw == "<|channel>"` never matched, so the token fell through
to the <|...|> skip filter but leaked as literal text "thought\n" into
code completions, causing HumanEval code=0%.

Fix: change both streaming and non-streaming paths to
`raw.starts_with("<|channel>")`.

This was tracked as follow-up Luce-Org#3 in
docs/experiments/gemma4-26b-thinking-control-2026-05-25.md.
Requires image rebuild to take effect.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 31, 2026

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

easel and others added 6 commits May 31, 2026 11:33
The sanity-check RUN step already verifies test_dflash and dflash_server
exist. Add test_server_unit so a failed test-binary build (e.g. a future
build target removal) is caught at image-build time rather than silently
shipping without the test binary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures the diagnosis (gemma forge 0/30 on 2026-05-30), the proposed
sixth detection pattern, the relaxed-JSON arg parser sketch, the
unit-test matrix, and codex's review (which forced reordering the new
pattern to slot Luce-Org#5 ahead of the bare-JSON sweep to avoid interception
of nested name/arguments-shaped args).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a sixth detection pattern to `parse_tool_calls` that recognizes
the plain-text tool invocations gemma emits in chat-completion content
(`call:get_country_info{country: "France"}` /
`call:execute-bead:read-file{path: "..."}` / etc).

The 2026-05-30 gemma full bench scored forge 0/30 because every row's
output carried these `call:<verb>{...}` invocations as text rather
than structured `tool_use` content blocks. None of the existing five
envelope-shaped detectors (`<tool_call>`, `<function=...>`,
`<tool_code>`, bare JSON) match the bare `call:` shape.

The new pattern:
- Anchors on a sentinel character (whitespace, comma, semicolon,
  open/close bracket, etc.) before `call:` so narrative usages like
  `narrative.call:foo` don't match.
- Supports namespaced verbs (`execute-bead:read-file`,
  `default_api:fetch_sales_data`) and strips the namespace before
  using the verb as the ToolCall name.
- Extracts the args block via a quote- and escape-aware balanced-brace
  scanner that tolerates `"`, `'`, and `` ` `` string literals and
  tracks `[]` depth alongside `{}`.
- Parses the args as strict JSON first, then falls back to a relaxed
  rewrite that quotes bare identifier keys and normalizes single/
  backtick quoted strings to double-quoted before retrying. Malformed
  args drop the single invocation without crashing or polluting other
  calls.
- Runs *before* the bare-JSON sweep so that inner args of the form
  `call:outer{"name": "inner", "arguments": {}}` aren't hijacked into
  a spurious `inner` ToolCall by pattern Luce-Org#6.

Downstream the existing wiring takes over: SseEmitter::accumulate
already calls parse_tool_calls; a non-empty ToolCall list flips
finish_reason to `tool_calls`, which the Anthropic /v1/messages
branch maps to `stop_reason="tool_use"` with `tool_use` content
blocks (http_server.cpp:2030-2090) and the OpenAI branch maps to
`choices[].message.tool_calls`.

The forge client-side workaround `_parse_plain_text_tool_calls`
shipping on feat/lucebox-docker (commit deba2fd) becomes redundant
once a server with this fix is deployed. It stays in place as
defense-in-depth for older deployed servers.

Test plan: 14 new C++ unit cases in test_server_unit.cpp covering
single / back-to-back / namespaced / snake- and kebab-case verbs;
tool-allowed filtering; mid-prose rejection vs. whitespace-led
acceptance; malformed args drop; inner `{}` inside string literals;
strict-JSON and relaxed-keys arg parsing; cleaned_text scrubbing;
the codex-requested inner `name`/`arguments` interception case; and
multi-line nested-array args mirroring the snapshot data. All pass
in a standalone driver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow slice of PR Luce-Org#135 into the current stack: daemon cache-slot parsing, independent extra TargetCache state, graph/feature-mirror swapping, and cleanup handling. Refresh auto-integration manifest after merging advanced PR Luce-Org#285.
easel and others added 6 commits May 31, 2026 11:55
The server binary only accepts these three values; "compress" is silently
rejected at startup with pflash falling back to off. Add a caster that
raises ValueError immediately on config_set so the error is caught early
rather than manifesting as a silent pflash=off at runtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rogress)

Records methodology, baseline speed result (32K session: wall=89.3s,
prefill~87s), and corrects prefill_mode="compress"→"auto" bug discovered
during setup. PFlash quality and speed legs TBD after server restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PFlash requires prefix_cache_slots>0 to work. With prefix_cache_slots=0
(current optimal config), all chunks are forced (100%), adding drafter
overhead with zero compression benefit.

Speed bench result: 1291/1318 chunks forced at 42K tokens → 97.9% kept.
Quality benchmark running; expected ≈ baseline (pflash is a no-op).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
std::string::starts_with() is C++20 but CMakeLists.txt requires C++17.
Replace with rfind("<|channel>", 0) == 0, idiomatic C++17 equivalent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the server-side call:<verb>{} tool parser fix (PR Luce-Org#323) and
the C++17 compatibility fix for starts_with. Benchmarks running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Continue the Luce-Org#135 selective-port stack with diagnostic-only SCHED_STEP and SCHED_DRAIN daemon commands. They report request counts and active/per-slot target-cache state without mutating live scheduler state. Refresh the auto-integration manifest and record the latest Luce-Org#285 head merge.
easel and others added 2 commits May 31, 2026 12:53
…sults

- Add partial agent_recorded results (2/4 PASS vs prior 3/26 PASS)
- Identify that channel routing fix likely explains agent_recorded
  improvement, not just the call:verb parser
- Document two distinct fixes in image 1443239

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PFlash with prefix_cache_slots=0 forces all KV chunks → zero compression.
Confirmed by bragi A/B test 2026-05-31. Update bracket comments and
module docstring to note both the drafter file AND prefix caching
requirements for pflash to be effective.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants