feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285
Open
easel wants to merge 72 commits into
Open
feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285easel wants to merge 72 commits into
easel wants to merge 72 commits into
Conversation
b5d4cc5 to
3642703
Compare
f2ddfc4 to
2be3eef
Compare
Collaborator
Author
|
Some commands to test this... copied from the readme. Install the lucebox wrapper: curl -fsSL https://raw.githubusercontent.com/easel/lucebox-hub/feat/lucebox-docker/lucebox.sh \
-o ~/.local/bin/lucebox.sh && chmod +x ~/.local/bin/lucebox.shRun lucebox using the docker image # Override the container image to the temporary build:
export LUCEBOX_IMAGE=ghcr.io/easel/lucebox-hub
# Check your machine for lucebox compatibility
lucebox check
# Start the lucebox server
lucebox serveRun benchmarks against a local server: uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --url http://localhost:1236Run benchmarks against open router uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --base-url https://openrouter.ai/api --model qwen/qwen3.6-27b --auth-env OPENROUTER_API_KEY |
…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.
…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.
…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)
…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.
… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)
…nch in-tree Squashes 78 commits from feat/lucebox-docker (PR Luce-Org#285) onto origin/main. Net: 189 files changed. Major workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile, docker-bake.hcl, .github/workflows/docker.yml with GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + /opt/lucebox-hub/HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard, container-state preflight), cmd_systemctl_passthrough (already- active short-circuit, restart-loop detection), cmd_update (bootstrap- installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default precedence), shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so lucebox update keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → 183 lines). * luce-bench: snapshot subcommand + canonical HostInfo schema v2 + levels (level0/1/2/3) + report subcommand + submit-baseline + regrade. * Server (C++): /props.host block + props_schema=4 + host_info read at startup, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. * Harness: client implementations for claude/codex/opencode/hermes/pi. * Strict 11-field config.toml allowlist for dflash.* runtime tunables. Deleted (rolled into new structure): * server/scripts/bench_agent.py, bench_he.py, bench_llm.py — replaced by luce-bench snapshot + areas. * lucebox configure, lucebox download-models, lucebox benchmark — replaced by config sub-app, models sub-app, autotune --sweep. * luce-bench --sweep flag — moved to argv-sniff subcommand dispatch. Conflict resolution: * server/scripts/bench_{agent,he,llm}.py — modify/delete kept the deletion (feat/lucebox-docker moved bench machinery into luce-bench). * README.md — took feat-branch version. origin/main had 19 commits worth of minor README tweaks since the branch base; those need to be folded back in as a follow-up PR. * docs/specs/openapi-props.yaml + docs/specs/props-endpoint.md — took feat-branch version. origin/main had 1 link-fix commit; feat-branch has the schema-4 + host-block additions that strictly supersede. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_load_or_build()` returned `config_mod.load()`'s result verbatim when config.toml existed, ignoring `LUCEBOX_*` env vars entirely. That contradicted the precedence lucebox.sh documents (env > toml > default) and bit sindri in production: its config.toml had `[image]` without a `registry` line, so the dataclass default `ghcr.io/luce-org/lucebox-hub` beat the systemd unit's `Environment=LUCEBOX_IMAGE=ghcr.io/easel/...`. Symptom: `lucebox start` brought up the wrong (stale luce-org) image even after explicit `lucebox install` + `lucebox pull` against easel. Fix: overlay env on top of whatever `load()` returns (or `live_config()` falls back to). Only the five top-level scalars have env hooks (LUCEBOX_VARIANT/IMAGE/PORT/CONTAINER/MODELS) — dflash/host/model intentionally don't. Adds two regression tests: - env beats config.toml when toml has no explicit value for that key, - env still wins when toml is absent (covers the live_config fallback). 102 lucebox tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
244257c to
f4db35b
Compare
…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.
- test_autotune_candidate_configs.py: sort imports (ruff I001). - download.py: api.repo_info() returns ModelInfo|DatasetInfo|SpaceInfo|KernelInfo and KernelInfo has no .siblings; use api.model_info() which returns ModelInfo (correct — we only query model repos here), resolving the mypy union-attr error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Luce-Org#10) Closes the two validated pieces of the adaptive-keep path (the label-free quality-reward idea was dropped — Momus-confirmed it can't catch confident off-task). Default-OFF; router gates these to agentic-routed requests. - regime_router.h: two pure helpers (stdlib-only, TDD'd) — clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer silently undercut the router's 0.25 floor. compression_failed(tokens, degenerate_close, agentic_compressed, min=8): true on empty/degenerate output of an agentic compressed turn. - adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume). - http_server.cpp: floor clamp at keep-apply; at the post-generate update site, on compression_failed → skip the bandit update (failure noise) and set the session to full keep for the next turn (deterministic recovery from the empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env (default 8) tunes the guard threshold. - 59 standalone unit tests, -Werror. LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts): - type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on. - guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13, bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0. - floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250. Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the 3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF via PFLASH_ROUTER_ENABLE.
The 2026-05-30 gemma full bench scored forge 0/30 cases with ``error_type=ValidationError`` on every row. Two stacked bugs: 1. The recording client called ``TextResponse(text=...)`` but the forge ``TextResponse`` field is named ``content`` — every send() raised a pydantic ValidationError, which surfaced as the per-row error_type. (Independent bug, fixed in one line: text=→content=.) 2. Even with Luce-Org#1 fixed, gemma emits ``call:get_country_info{country: "France"}call:summarize{text: "..."}`` as plain text in a ``text`` content block — not as Anthropic ``tool_use`` structured blocks — so the old client surfaced text-only responses and forge would have nudged forever waiting for a tool call. This patch scans the assistant text for ``call:<verb>{args}`` invocations, parses the args as relaxed JSON (json.loads first, then a permissive pass that quotes bare keys), and synthesizes ``ToolCall`` entries that forge's WorkflowRunner consumes natively. Malformed args are dropped (per-call, not per-response) so a single mangled invocation doesn't crash the bench. The forge LLMResponse contract is ``list[ToolCall] | TextResponse`` (forge_eval._forge.core.workflow), so synthesis stays within the existing types — no anthropic.types.Message construction needed. Why client-side: the server's chat_template / SSE emitter could translate the plain-text shape into Anthropic tool_use blocks upstream (cleaner long-term), but that's a C++ change with broader scope. The client-side path also future-proofs the bench for any other model that uses the same plain-text tool serialization (codex-mini, DDX bead executor, etc.) — same intent already recognized in lucebench.areas.agent's _CALL_INVOCATION pattern. Tests cover the parsing/synthesis helper in isolation: empty input, single calls, back-to-back calls, snake_case + kebab-case + ns:verb names, nested braces, strings containing } chars, unbalanced braces, and unparseable args. Full test suite remains green (291 passed, +16 from this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Update the auto-integration manifest after PR Luce-Org#285 advanced during the cron run. Record the clean merge, draft list change, retained worktree, and luce-bench Forge grader validation.
…squashed) Brings in the full pflash prefill-compression system as a single revertible commit. Default-OFF behind PFLASH_ROUTER_ENABLE=1; requires Qwen3-0.6B drafter weights to activate. Key capabilities merged from pflash/ee7: - ee7 early-exit drafter + anchor-transitive cascade + tail-capture guard - Adaptive keep-ratio / anchor_radius (eliminates 64K NIAH cliff) - Adaptive compression-regime router (type-gate: agentic=0.25, retrieval=full) - Adaptive fa_window composition via per-request override - PFLASH_*/DFLASH_* dual env-var aliasing with transitive cascade defaults - Empty-response guard + bandit floor reconciliation - Closed <think> prefill injection in Jinja renderer for Qwen3 nothink mode - eval_quality_compare.py for LongBench F1 regression detection - New test suites: anchor_transitive, drafter regression, regime_router Conflicts resolved: - .gitignore: kept both lucebox-hub entries and pflash backup-suffix entries - chat_template.cpp: merged Qwen3 closed-think suffix injection into our PromptRenderResult return path - test_server_unit.cpp: kept started_in_thinking regression suite (HEAD) and adapted pflash's 5 Qwen3 closed-think tests to use PromptRenderResult.text Original 16-commit range: d4546a5..8fc961b (pflash/ee7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # README.md # server/src/common/model_backend.h # server/src/qwen35/qwen35_backend.cpp
…_unit - scripts/pflash_session_bench.py: standalone A/B benchmark for pflash using the multi-turn session fixture (8K-131K token cases). Sends the largest case fitting the server's max_ctx and reports wall/decode timing. Use --bucket to select a specific tier. - Dockerfile: add test_server_unit to cmake build targets so the template-coverage regression suite ships in the image for CI checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/pflash_session_bench.py">
<violation number="1">
P2: `decode` TPS is mislabeled: it is computed from total wall time, not decode time.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| @@ -0,0 +1,156 @@ | |||
| #!/usr/bin/env python3 | |||
Contributor
There was a problem hiding this comment.
P2: decode TPS is mislabeled: it is computed from total wall time, not decode time.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/pflash_session_bench.py, line 143:
<comment>`decode` TPS is mislabeled: it is computed from total wall time, not decode time.</comment>
<file context>
@@ -0,0 +1,156 @@
+ wall = result["wall_s"]
+ in_tok = result["prompt_tokens"]
+ out_tok = result["completion_tokens"]
+ tps = out_tok / wall if wall > 0 else 0
+ print(f" wall={wall:.1f}s in={in_tok} out={out_tok} "
+ f"decode={tps:.1f}tok/s chars={result['content_chars']} "
</file context>
The raw vocab token for Gemma4's thinking channel opener is
"<|channel>thought" (id 100), not "<|channel>". The previous equality
check `raw == "<|channel>"` never matched, so the token fell through
to the <|...|> skip filter but leaked as literal text "thought\n" into
code completions, causing HumanEval code=0%.
Fix: change both streaming and non-streaming paths to
`raw.starts_with("<|channel>")`.
This was tracked as follow-up Luce-Org#3 in
docs/experiments/gemma4-26b-thinking-control-2026-05-25.md.
Requires image rebuild to take effect.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
The sanity-check RUN step already verifies test_dflash and dflash_server exist. Add test_server_unit so a failed test-binary build (e.g. a future build target removal) is caught at image-build time rather than silently shipping without the test binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Captures the diagnosis (gemma forge 0/30 on 2026-05-30), the proposed sixth detection pattern, the relaxed-JSON arg parser sketch, the unit-test matrix, and codex's review (which forced reordering the new pattern to slot Luce-Org#5 ahead of the bare-JSON sweep to avoid interception of nested name/arguments-shaped args). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a sixth detection pattern to `parse_tool_calls` that recognizes
the plain-text tool invocations gemma emits in chat-completion content
(`call:get_country_info{country: "France"}` /
`call:execute-bead:read-file{path: "..."}` / etc).
The 2026-05-30 gemma full bench scored forge 0/30 because every row's
output carried these `call:<verb>{...}` invocations as text rather
than structured `tool_use` content blocks. None of the existing five
envelope-shaped detectors (`<tool_call>`, `<function=...>`,
`<tool_code>`, bare JSON) match the bare `call:` shape.
The new pattern:
- Anchors on a sentinel character (whitespace, comma, semicolon,
open/close bracket, etc.) before `call:` so narrative usages like
`narrative.call:foo` don't match.
- Supports namespaced verbs (`execute-bead:read-file`,
`default_api:fetch_sales_data`) and strips the namespace before
using the verb as the ToolCall name.
- Extracts the args block via a quote- and escape-aware balanced-brace
scanner that tolerates `"`, `'`, and `` ` `` string literals and
tracks `[]` depth alongside `{}`.
- Parses the args as strict JSON first, then falls back to a relaxed
rewrite that quotes bare identifier keys and normalizes single/
backtick quoted strings to double-quoted before retrying. Malformed
args drop the single invocation without crashing or polluting other
calls.
- Runs *before* the bare-JSON sweep so that inner args of the form
`call:outer{"name": "inner", "arguments": {}}` aren't hijacked into
a spurious `inner` ToolCall by pattern Luce-Org#6.
Downstream the existing wiring takes over: SseEmitter::accumulate
already calls parse_tool_calls; a non-empty ToolCall list flips
finish_reason to `tool_calls`, which the Anthropic /v1/messages
branch maps to `stop_reason="tool_use"` with `tool_use` content
blocks (http_server.cpp:2030-2090) and the OpenAI branch maps to
`choices[].message.tool_calls`.
The forge client-side workaround `_parse_plain_text_tool_calls`
shipping on feat/lucebox-docker (commit deba2fd) becomes redundant
once a server with this fix is deployed. It stays in place as
defense-in-depth for older deployed servers.
Test plan: 14 new C++ unit cases in test_server_unit.cpp covering
single / back-to-back / namespaced / snake- and kebab-case verbs;
tool-allowed filtering; mid-prose rejection vs. whitespace-led
acceptance; malformed args drop; inner `{}` inside string literals;
strict-JSON and relaxed-keys arg parsing; cleaned_text scrubbing;
the codex-requested inner `name`/`arguments` interception case; and
multi-line nested-array args mirroring the snapshot data. All pass
in a standalone driver.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 tasks
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow slice of PR Luce-Org#135 into the current stack: daemon cache-slot parsing, independent extra TargetCache state, graph/feature-mirror swapping, and cleanup handling. Refresh auto-integration manifest after merging advanced PR Luce-Org#285.
The server binary only accepts these three values; "compress" is silently rejected at startup with pflash falling back to off. Add a caster that raises ValueError immediately on config_set so the error is caught early rather than manifesting as a silent pflash=off at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rogress) Records methodology, baseline speed result (32K session: wall=89.3s, prefill~87s), and corrects prefill_mode="compress"→"auto" bug discovered during setup. PFlash quality and speed legs TBD after server restart. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PFlash requires prefix_cache_slots>0 to work. With prefix_cache_slots=0 (current optimal config), all chunks are forced (100%), adding drafter overhead with zero compression benefit. Speed bench result: 1291/1318 chunks forced at 42K tokens → 97.9% kept. Quality benchmark running; expected ≈ baseline (pflash is a no-op). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
std::string::starts_with() is C++20 but CMakeLists.txt requires C++17.
Replace with rfind("<|channel>", 0) == 0, idiomatic C++17 equivalent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the server-side call:<verb>{} tool parser fix (PR Luce-Org#323) and
the C++17 compatibility fix for starts_with. Benchmarks running.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Continue the Luce-Org#135 selective-port stack with diagnostic-only SCHED_STEP and SCHED_DRAIN daemon commands. They report request counts and active/per-slot target-cache state without mutating live scheduler state. Refresh the auto-integration manifest and record the latest Luce-Org#285 head merge.
…sults - Add partial agent_recorded results (2/4 PASS vs prior 3/26 PASS) - Identify that channel routing fix likely explains agent_recorded improvement, not just the call:verb parser - Document two distinct fixes in image 1443239 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PFlash with prefix_cache_slots=0 forces all KV chunks → zero compression. Confirmed by bragi A/B test 2026-05-31. Update bracket comments and module docstring to note both the drafter file AND prefix caching requirements for pflash to be effective. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR turns Lucebox into a one-command local inference deployment and ships the two tools that operate it:
lucebox(the host CLI that runs and tunes the server) andluce-bench(the benchmark + grading framework that measures it). All three ship together so a fresh box goes from nothing to a tuned, benchmarked server with a single install.The three pieces, what each is, and how to use it:
1. Docker — the server image
A CUDA 12.8 image (
ghcr.io/luce-org/lucebox-hub:cuda12) that builds the dflash server and bundlesserver/,lucebox/,harness/, andluce-bench/. The entrypoint dispatchesserve(default),benchmark, anyluceboxsubcommand, orshell. An in-container autotune fallback picks VRAM-tiered defaults and resolves the draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6).Use it directly:
Image tags:
:cuda12,:vX.Y.Z-cuda12,:X.Y-cuda12,:sha-<short>-cuda12. Built and pushed by.github/workflows/docker.yml;docker-bake.hclhas acuda13slot ready.2.
lucebox— the host CLIlucebox.shis the host-side wrapper (deps:docker+nvidia-smionly). It probes the host, writes a tunedconfig.toml, runs the container as a user-systemd service, and delegates provisioning/workloads to the in-container Python CLI (models,autotune,profile,smoke,config, the client drivers).Stand a server up:
Tune it to the GPU:
The running config is observable at
GET /props(schema 4), which now reports ahostblock — kernel, OS, WSL vs native, driver, CPU, RAM, GPU — so a server self-describes its real config and host.3.
luce-bench— the benchmark + grading frameworkIn-tree workspace member (
luce-bench/,0.2.7.dev0) that scores any OpenAI/Anthropic-compatible endpoint and writes versioned, comparable result files. Areas:smoke,ds4-eval(92 reasoning items),gsm8k,truthfulqa-mc1,hellaswag,code,longctx,agent,agent_recorded,forge. Every result stamps a per-areagrader_versionand ahostblock (from/props.host, or a clearly-marked client-side fallback for servers without/props).Run it:
uvx --from 'git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench' \ luce-bench --base-url http://localhost:8080 --model dflash --areas all --no-thinkThinking control is portable. Each request carries three control shapes (
chat_template_kwargs.enable_thinking, Anthropicthinking:{type},reasoning_effort). For servers that ignore the API flags (e.g. OpenRouter),--prompt-thinking-control {auto,on,off}(defaultauto) injects the model family's in-band token (/no_think,/think);autofires only when/propsshows no server-side enforcement. A post-run verifier recordsthinking_control_honoredso a nothink run that secretly reasoned is flagged, not silently mislabeled.Comparing results: runs from one grader version are comparable as written. For older snapshots graded by a different version,
luce-bench regrade <dirs>re-scores stored outputs at the current pinned grader and refuses to place mismatched-version (or mismatched-host) runs in the same row.report/snapshot/submit-baselineround out the reporting surface.Also in this PR
harness/— drives real clients (claude_code,codex,opencode,hermes,pi,openclaw) against a running server;lucebox profiledelegates bench runs here.share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json+_schema.json, so the server resolves sampler defaults, thinking budgets, and the force-close hint per model.pyproject.tomldeclares all members (server,lucebox,luce-bench,harness,optimizations/{megakernel,pflash});[tool.uv.sources] luce-bench = { workspace = true }replaces the prior git-tag pin.release-luce-bench.ymlpublishes to PyPI onluce-bench-v*tags.server/docs/benchmark-snapshot spec and experiment write-ups.server/scripts/bench_*.py(their work now lives inluce-bench).Out of scope / follow-ups
Validation
uv syncclean on the workspace; luce-bench test suite passes.--areas allsweeps run end-to-end against bragi (RTX 5090 Laptop), sindri (RTX 3090 Ti), vidar (M2 Ultra / MLX), and OpenRouter, think and nothink, all on one grader version./props.hostconfirmed populated on lucebox servers (bragi + sindri report WSL2); OpenRouter nothink confirmed honored via client-side/no_thinkinjection.