Skip to content

feat(dFlash): MTP foundation#237

Open
dusterbloom wants to merge 13 commits into
Luce-Org:mainfrom
dusterbloom:feat/dflash-mtp-foundation
Open

feat(dFlash): MTP foundation#237
dusterbloom wants to merge 13 commits into
Luce-Org:mainfrom
dusterbloom:feat/dflash-mtp-foundation

Conversation

@dusterbloom
Copy link
Copy Markdown
Collaborator

@dusterbloom dusterbloom commented May 20, 2026

Summary

Adds Qwen3.5/3.6 NextN MTP support: IMtpModule interface + chain runner in common/, Qwen35MtpModule concrete impl, qwen35 backend wiring, and --mtp-gguf flags in dflash_server.

Works with both GGUF layouts:

  • Separate files: backbone GGUF as primary + MTP-head GGUF via --mtp-gguf
  • Unsloth single-file combined GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf): pass the same file as both the primary and --mtp-gguf; the loader reads backbone tensors and MTP head tensors from it. Verified greedy-equivalent + same 1.82× speedup as the separate-files setup.

Validation

Single-prompt HTTP (/v1/chat/completions, temp=0)

HumanEval-10 via harness/benchmarks/generation_benchmark.py on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):

Mode mean tok/s speedup accept rate
AR 32.46 1.00x
chain γ=2 59.24 1.82x 0.93–1.00

All 10 prompts pass functional checks. Greedy text byte-identical on 8/10; remaining 2 within the AR-vs-AR non-determinism baseline (verified via control run: same prompts produced 9/10 identical text across two AR-only runs of the same server).

Confirmed equivalent between separate-files and unsloth single-file layouts: 1.82× → 1.82× (statistically identical), 8/10 byte-identical text in both, same 95–100% accept rate range.

Agentic-CLI smoke (OpenAI Responses API)

pi (v0.55.3) via harness/clients/run_pi.sh: server starts with MTP, SSE streaming delivers tokens, [mtp_decode] iters=2 proposed=4 accepted=4 accept_rate=1.00, marker received exact-match. Proves the full HTTP → Responses API → SSE → MTP chain → tokenizer → client round-trip.

Long-prompt resilience

25K-token prompt under --mtp-gguf + --max-ctx 65536: server stays alive, MTP fires (iters=16, 50% accept), real response returned. Earlier silent SIGKILL on long prompts (DFLASH27B_MTP_CTX default 8192 not tracking max_ctx) fixed in `ada7bb0`; warm_head_kv overflow now triggers graceful AR fallback.

Unit tests

test_common_mtp_orchestrator: 20/20 pass — chain runner state machine (γ propagation, EOS termination, partial-accept rollback, n_gen termination), orchestrator lifecycle, Qwen35MtpModule input-validation guards, T20 ExternalDrafter partial-accept hidden threading (asserts the set_capture_row + consume_captured_hidden invariant the `74f708a` fix added). Mock-based, no GGUF required.

Scope

  • IMtpModule + INativeMtp + IExternalDrafterMtp mixins
  • Qwen35MtpModule (INativeMtp impl) + GGUF loader + step graph
  • MtpChainRunner + MtpOrchestrator
  • dflash_server: --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk

Not in this PR (follow-ups)

  • Gemma4 ExternalDrafter MTP module (uses the IExternalDrafterMtp interface declared here)
  • Per-request DFlash/MTP/auto dispatcher (commit `3b65ac1` on `feat/mtp-prefix-warm-ghost`)
  • IDrafter abstraction unifying DFlash + MTP under one runner
  • Tree-mode runner + entropy-adaptive selector
  • Prefix-cache WARM hit for MTP

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 34 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/mtp_chain_runner.cpp Outdated
Comment thread dflash/src/common/mtp_orchestrator.cpp
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Addressed cubic review in commit 74f708a:

  • F1 (P2): add missing io.emit(-1) on chain-runner failure path in warm_and_decode (mtp_orchestrator.cpp:152)
  • F2 (P1): call set_capture_row(accept_n) + consume_captured_hidden() before threading running_hidden on partial accept for ExternalDrafter flavor; NativeHeads path is unaffected (next_hidden is always empty for that flavor, new branch not entered)

Build clean. All 19 unit tests pass. T7 (partial_accept_rollback) exercises the NativeHeads rollback path only — the ExternalDrafter hidden-sync invariant is not covered by the existing suite (noted for follow-up).

E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from 8/10 pre-fix). F2 is confined to ExternalDrafter and does not affect the Qwen3.6/NativeHeads path tested here. he_01/he_02 divergences remain — consistent with prior AR-vs-AR control showing GPU non-determinism on those two prompts.

@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Coverage gap closed in bf2d3e9: T20 now exercises the ExternalDrafter partial-accept path that the 74f708a fix addressed (T7 used a NativeHeads stub, so it never reached the fixed branch). 20/20 unit tests pass.

@dusterbloom dusterbloom marked this pull request as draft May 20, 2026 19:36
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Two more fixes from real-workload hermes testing (commit ada7bb0):

A. Silent server death on long-prompt MTP: DFLASH27B_MTP_CTX defaulted to 8192 but the server never propagated --max-ctx to that env var. With --max-ctx 65536, the backbone KV was allocated at 65536 slots while the MTP head_kv stayed at 8192 — VRAM mismatch caused OOM on the next prefill. (The warm_head_kv overflow guard already returned false cleanly — the bug was upstream at init time.) Fix: 5-line propagation in server_main.cpp using setenv(..., overwrite=0) so user env still wins.

B. --draft + --mtp-gguf conflict: server now logs a WARNING and clears draft_path when both are set (MTP wins).

Verified: 25K-token prompt → server stays alive, MTP ran (iters=16 proposed=32 accepted=16 accept_rate=0.50), real response returned. Conflict invocation prints the warning and starts cleanly with draft = (none) in the config banner.

@dusterbloom dusterbloom marked this pull request as ready for review May 20, 2026 19:53
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 34 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/src/qwen35/qwen35_mtp.cpp Outdated
Comment thread dflash/test/test_dflash.cpp
Comment thread dflash/test/test_common_mtp_orchestrator.cpp
Comment thread dflash/test/test_common_mtp_orchestrator.cpp Outdated
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Addressed cubic re-review in commit f6e8e94:

P1:

  • gguf_mmap.h open() now idempotent (releases prior mapping first via placement-destroy+reconstruct; leaves object in default state on any failure path)
  • gguf_mmap.h Windows release() now calls CloseHandle() before clearing handle_ (fix by inspection — Linux build cannot exercise the Windows path)
  • qwen35_mtp.cpp GPU warmup off-by-one (> -> >=): a prompt of exactly n_ctx tokens would have written slot n_ctx (out-of-bounds; valid range is [0, n_ctx-1] with slot_start=1). Now declines MTP cleanly and chain runner falls through to AR.

P2:

  • test_dflash.cpp missing hidden_seq capture now clears all_prefill_hidden and breaks out of the prefill loop instead of running warm_head_kv on an undefined buffer
  • T7 now asserts restore_kv_at_chain_calls >= 1 — rollback path is confirmed to fire on partial accept

P3:

  • T6 EOS termination assertion tightened from total_emitted <= 10 to total_emitted == 1 (EOS is the first emitted token; runner must stop immediately)

Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical.

Comment thread dflash/src/common/backend_factory.h Outdated

// MTP (Multi-Token Prediction) speculator — mutually exclusive with --draft.
// When mtp_gguf_path is set, the backend ignores draft_path.
const char * mtp_gguf_path = nullptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF which contains MTP in the main model?

Comment thread dflash/src/common/backend_factory.h Outdated
// When mtp_gguf_path is set, the backend ignores draft_path.
const char * mtp_gguf_path = nullptr;
int mtp_gamma = 0; // 0 = MTP loaded but not active; >0 = chain depth
const char * mtp_draft_source = nullptr; // "chain" (default) | "mtp_topk"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a enum?

@@ -0,0 +1,219 @@
// common/gguf_mmap.h — RAII wrapper for platform-conditional mmap of GGUF files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worth a separate PR

// supports_mtp() returns true. Default returns nullptr.
// Forward-declared to avoid a header dependency from model_backend.h
// on mtp_interface.h — backends include both as needed.
virtual mtp::IMtpModule * mtp() { return nullptr; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return nullptr indicate this model doesn't support mtp?

// matching mixin (IExternalDrafterMtp / INativeMtp).
enum class MtpFlavor {
ExternalDrafter, // Gemma4-style: separate drafter, h_prev chain
NativeHeads, // Qwen3.6-style: MTP heads in the backbone
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should hide into ModelBackend.

const int n = std::min(prefill_ubatch, prompt_len - start);
std::vector<int32_t> chunk(req.prompt.begin() + start,
req.prompt.begin() + start + n);
if (!target->verify_batch(chunk, start, last_tok, nullptr)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtp should only change how we generate the batch (v.s. draft), can we normalize the code with the existing logic?

@@ -0,0 +1,133 @@
// qwen35_mtp_graph.h — CUDA cgraph for Qwen3.6 MTP head step forward.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the existing graph API not a new API unless we want to put MTP header into another ggml backend.

@@ -0,0 +1,225 @@
// qwen35_mtp_loader.cpp — Discovery loader for Qwen3.6 -MTP-GGUF files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the previous comment about GGUF loading.

@weicj
Copy link
Copy Markdown
Collaborator

weicj commented May 21, 2026

I’m working on a shared hardware/backend/runtime placement foundation for the cpp native server path. #236

Could we align the MTP backend/device selection with that shared placement layer, instead of keeping a separate MTP-specific selection path?

This would avoid duplicated CUDA/HIP/device handling across MTP,DFlash, PFlash, and future speculative paths.

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Mechanical rename per weicj's review on PR Luce-Org#237 — the legacy namespace
name baked the first backend (Qwen3-27B) into shared code. Renaming to
dflash::common removes the backend leak from the substrate so future
backends plug into a neutral namespace.

Scope:
- namespace dflash27b → namespace dflash::common
- dflash27b::* → dflash::common::*
- CMake static lib dflash27b → dflash_common
- CMake project(dflash27b) → project(dflash)
- Private CMake vars _dflash27b_* → _dflash_*
- Stale comment references

Out of scope (deferred to a follow-up):
- Public C header dflash/include/dflash27b.h
- C symbol dflash27b_last_error()
- Preprocessor macros DFLASH27B_*
- Env vars DFLASH27B_*

Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during
CUDA host-compiler ID detection. Workaround:
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11

No behavior change. Clean build green. Symbol mangling confirmed as
N6dflash6common* via nm; zero residual dflash27b symbols outside the
deferred public C ABI.
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Mechanical rename per @weicj's review on Luce-Org#237. Renames the legacy
backend-baked namespace to a neutral one so future backends plug into
shared code without name leakage.

Scope:
- namespace dflash27b → namespace dflash::common (sources + tests)
- dflash27b::* → dflash::common::*
- CMake static lib dflash27b → dflash_common
- CMake project(dflash27b) → project(dflash)
- Private CMake vars _dflash27b_* → _dflash_*
- Stale comment references

Out of scope (deferred to a follow-up):
- Public C header dflash/include/dflash27b.h
- C symbol dflash27b_last_error()
- Preprocessor macros DFLASH27B_*
- Env vars DFLASH27B_*

Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during
CUDA host-compiler ID detection. Workaround:
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11

No behavior change. Symbol mangling confirmed via nm as N6dflash6common*;
no residual dflash27b mangled symbols outside the deferred public C ABI.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread dflash/src/server/server_main.cpp Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per @howard0su's review on Luce-Org#237: gguf_mmap.h/.cpp are platform
abstraction (POSIX mmap / Windows MapViewOfFile) that will be reused
by multiple loaders. Extracting into a standalone PR ahead of the
loader refactor (target/draft/MTP heads all benefit).

Includes the cubic P1 fixes from f6e8e94:
- open() is now idempotent (releases prior mapping before re-opening,
  leaves object in default state on any failure path)
- Windows release() now calls CloseHandle() before clearing handle_

New: test_gguf_mmap unit test covers open/close, idempotency,
missing-file path, RAII destructor.

Stacked on Luce-Org#241 (uses dflash::common namespace from the start).
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by
the legacy-flag inference because MtpSource::None served both as the
default and as the explicit 'none' value.

Adds an internal MtpSource::Unset sentinel as the new default.
Legacy-flag inference only fires when the field is still Unset. After
inference (or if no legacy flag matched), Unset is resolved to None
before any backend code sees it.

User-visible CLI surface unchanged: --mtp-source still accepts exactly
{none, native, external, auto}. Unset is internal-only and never
escapes arg parsing. Defensive assert in create_backend() enforces this.
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per @howard0su's review on Luce-Org#237 (lines 57, 59):
- 57: 'shall we use the unsloth single-file MTP-in-target GGUF?'
- 59: 'why not a enum?'

Replaces:
  const char * mtp_gguf_path = nullptr;
  const char * mtp_draft_source = nullptr;  // "chain" | "mtp_topk"
with:
  enum class MtpSource { None, Native, ExternalDrafter, Auto };
  MtpSource    mtp_source     = MtpSource::None;
  const char * mtp_gguf_path  = nullptr;  // only for ExternalDrafter
  bool         mtp_use_topk   = false;    // false=chain, true=mtp_topk

Adds gguf_contains_mtp_tensors() probe (keyed on qwen35.nextn_predict_layers
metadata) so --mtp-gguf becomes optional when the primary GGUF embeds
MTP tensors (unsloth single-file case).

Stacked on Luce-Org#237. dflash_server arg parsing updated to:
- --mtp-source [none|native|external|auto] (new explicit flag)
- --mtp-gguf PATH (now optional; only needed for ExternalDrafter)
- Old --mtp-draft-source string flag warns + ignored (migration aid)
- --mtp-gamma alone triggers Auto detection

All test_common_mtp_orchestrator tests still pass (mock-based,
unaffected by the config-surface change).
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by
the legacy-flag inference because MtpSource::None served both as the
default and as the explicit 'none' value.

Adds an internal MtpSource::Unset sentinel as the new default.
Legacy-flag inference only fires when the field is still Unset. After
inference (or if no legacy flag matched), Unset is resolved to None
before any backend code sees it.

User-visible CLI surface unchanged: --mtp-source still accepts exactly
{none, native, external, auto}. Unset is internal-only and never
escapes arg parsing. Defensive assert in create_backend() enforces this.
@dusterbloom dusterbloom force-pushed the feat/dflash-mtp-foundation branch from 1051e71 to 8ec9a95 Compare May 21, 2026 20:10
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
- server_main.cpp: gate --draft suppression and DFLASH27B_MTP_CTX on
  concrete MtpSource (Native|ExternalDrafter) instead of !None. Auto
  defers to backend GGUF probe; preserves --draft as fallback when
  Auto resolves to None.

- mtp_orchestrator.cpp: wrap io with req.on_token via
  DaemonIO::with_token_callback (mirrors laguna_backend.cpp:151 and
  gemma4_backend.cpp:172). MTP requests now get streaming disconnect
  cancellation and per-token callbacks.

20/20 MTP test suite still passes.
@davide221 davide221 requested a review from howard0su May 22, 2026 09:07
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.

- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
  IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
  IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
  hoisted out of the backend. Three KV-reconciliation paths
  (accept-all / fast rollback / recommit) share a single post-iter
  invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
  to chain runner. Owns only control flow; all compute lives in
  DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
  backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
  inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
  cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
  at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
  generate() delegates to common::mtp::warm_and_decode when MTP is
  configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
  command surface unchanged.

Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
… paths

Adds unit tests exercising MtpChainRunner state machine (gamma propagation,
EOS termination, partial-accept rollback, n_gen termination), MtpOrchestrator
lifecycle (reset_chain ordering, set_initial_hidden plumbing, gamma
derivation), and Qwen36MtpModule input-validation guards (attach(nullptr),
out-of-range gamma, pre-attach calls, shutdown idempotency).

All new tests are mock-based and require no GGUF.
Both qwen35_backend and the test_dflash harness called the 4-arg overload of
Qwen36MtpModule::init() with backend.tensor_context() — the BACKBONE ggml_ctx.
That ctx contains only blk.0..63; MTP head tensors live in blk.64.* in the
separate MTP GGUF, so the loader could never resolve them and init failed
silently before chain decode.

Switched both call sites to the 3-arg overload init(gguf_path, target, err)
which calls load_gguf_tensor_context(mtp_gguf_path) internally to build the
correct context for tensor lookup.

Caught by E2E sweep on real Qwen3.6-27B Q4_K_M + MTP-Q4_K_M GGUFs.
… is qwen35)

The GGUF general.architecture field on both 'Qwen3.5-0.8B-Q4_K_M.gguf' and
'Qwen3.6-27B-MTP-Q4_K_M.gguf' is 'qwen35'. The 'Qwen3.6' label lives only in
general.name (model display name). The MTP module is a NextN extension of the
qwen35 architecture, not a separate arch.

Aligns naming with Qwen35Backend (the backbone) and the loader's arch check
in gguf_target_loader.cpp:271. Pure rename — no functional change.

Files renamed (git mv, blame preserved):
  src/qwen36/qwen36_mtp.{h,cpp}         → src/qwen35/qwen35_mtp.{h,cpp}
  src/qwen36/qwen36_mtp_graph.{h,cpp}   → src/qwen35/qwen35_mtp_graph.{h,cpp}
  src/qwen36/qwen36_mtp_loader.cpp      → src/qwen35/qwen35_mtp_loader.cpp
  dflash/src/qwen36/ directory removed

E2E re-verified on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):
  AR (γ=0):   35.6 tok/s
  Chain γ=2: 57.8 tok/s = 1.62× over AR  (accept_rate=0.83)
  Chain γ=4: 50.9 tok/s = 1.43× over AR  (accept_rate=0.58)
Greedy token output byte-identical across all three modes.
Mechanical fixes from style review:
- qwen35_mtp_loader.cpp: rewrite header from stale 'PR 2 skeleton'
  narrative to current behavior
- mtp_chain_runner.cpp: remove dead if-block at the post-propose
  size check (comment-only body)
- Qwen35DaemonArgs::n renamed to Qwen35DaemonArgs::mtp_draft_source
  to match Qwen35Config::mtp_draft_source
- mtp_orchestrator namespace: drop unused 'common' layer
  (dflash27b::common::mtp -> dflash27b::mtp)
- mtp_orchestrator: dynamic_cast -> static_cast (flavor() already
  discriminates the concrete type)
- qwen35_mtp.cpp: drop redundant 'static' on TU-local helpers inside
  anonymous namespace
- qwen35_mtp.{h,cpp}: convert Phase-A/B roadmap comments to TODOs or
  remove (completed phases)
- qwen35_backend.h: label ensure_decode_cache/tensor_context as
  test-only (they remain public for harness access)
- dflash_target.h: forward-declare DDTree; move ddtree.h include to
  .cpp files that use it concretely
- mtp_chain_runner.h: clean stale PR-internal cross-reference
- qwen35_backend.h: terminology consistency (MTP module)
- mtp_orchestrator.cpp: comment that enable_hidden_seq_capture is
  no-op for non-MTP targets
- test_common_mtp_orchestrator.cpp: remove PR-review narrative
- qwen35_backend.cpp: add fflush after [mtp] loaded printf
- No functional change. E2E sweep re-verified greedy byte-identical
  output across AR / chain γ=2 / chain γ=4.
Adds --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk to
the native C++ HTTP server (dflash/src/server/dflash_server). Threads
them through BackendArgs into Qwen35Config so the existing Qwen35Backend
MTP path (already wired) is reachable through the production HTTP
endpoint, not just the test_dflash harness.

Mirror of the existing flags in dflash/scripts/server.py.

Verified via HTTP /v1/chat/completions: AR baseline and chain γ=2
produce greedy-identical text under temperature=0 on Qwen3.6-27B
Q4_K_M + MTP-Q4_K_M. Also confirmed with unsloth single-file combined
GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf) passed as --mtp-gguf — identical
output, greedy equivalent.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record the latest unattended PR probe pass, including fresh direct-merge conflict checks for the remaining non-integrated PRs and a usable tmux/Codex feasibility report for PR Luce-Org#237. No code changes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record the 2026-05-29 19:36 cron preflight, fresh direct-merge probes for the remaining non-integrated PRs, and a tmux/Codex feasibility report for PR Luce-Org#237. No code changes were made.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-29 20:14 unattended integration pass, including fresh conflict probes for unresolved old-layout PRs and the latest tmux Claude/Codex delegation results for PR Luce-Org#237.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-29 21:56 cron refresh, fresh direct-merge probes for unresolved old-layout PRs, and the latest tmux-driven Luce-Org#237 Claude/Codex delegation outcome. No code changes are included.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-29 22:17 unattended integration probe pass. Note current included PR heads, retained conflicted worktrees, and the latest tmux-driven Luce-Org#237 salvage delegation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record 2026-05-30 01:26 cron probe results, including fresh Codex review for PR Luce-Org#237 and retained probe artifact paths.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 02:03 UTC-4 unattended run, fresh direct-merge probes for remaining old-layout PRs, and a tmux-driven Codex feasibility review for PR Luce-Org#237.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 03:30 unattended reconciliation pass, including fresh conflicting probe worktrees, Codex Luce-Org#237 salvage review output, and validation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the May 30 04:31 unattended reconciliation, refreshed conflict probes, and a read-only Codex assessment for PR Luce-Org#237's current-layout selective port path.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 04:49 unattended reconciliation, fresh conflict probes, and renewed Luce-Org#237 Claude/Codex assessments. No stack code changes were required because current auto-integration already contains the mergeable non-draft PR heads.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Reconfirm open PR containment against origin/main and easel/auto-integration. Record fresh worktree probes for non-contained PRs plus read-only Luce-Org#237 Claude/Codex delegation attempts.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 08:35 unattended integration pass, including fresh direct-merge probes for non-contained PRs and the read-only Luce-Org#237 Claude/Codex delegation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 09:14 cron reconciliation pass, fresh conflict probes for the remaining non-contained PRs, and the Luce-Org#237 Claude/Codex delegation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 11:16 EDT unattended integration pass: refreshed PR-head containment, fresh conflicted probe worktrees, and Codex feasibility reports for the Luce-Org#305 and Luce-Org#237 selective-port candidates. No product-code stack changes were made.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 13:21 cron probe results, exact PR-head containment, retained worktrees, and the Codex Luce-Org#237 selective-port recommendation.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 15:09 cron pass, repeated conflict probes for remaining non-integrated PRs, and the tmux-delegated Luce-Org#237 selective-port assessment.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 15:57 cron pass, refreshed PR-head containment, fresh conflict probes for the remaining non-integrated PRs, and the tmux-driven Codex Luce-Org#237 feasibility report.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 17:00 EDT unattended reconciliation pass, including refreshed PR-head containment, conflict probes, and the bounded Claude rerun outcome for PR Luce-Org#237.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 17:38 unattended refresh, exact PR-head containment, repeated conflicted probe results, and the tmux-driven Luce-Org#237 selective-port feasibility audit.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the latest unattended PR enumeration, direct-merge probe results, and the tmux-driven Luce-Org#237 feasibility audit. No product code changed in this refresh.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Rerun direct merge probes for remaining non-integrated PRs and record the tmux-driven Codex audit for Luce-Org#237/Luce-Org#305. No product-code changes were integrated.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-30 23:50 cron pass, fresh conflict probes, and the tmux-driven Luce-Org#237 audit results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 00:29 EDT unattended reconciliation run, current open PR containment, repeated conflict probes, and the Luce-Org#237 delegated audit outcomes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 02:27 UTC-4 unattended run, fresh conflict probes, and the Codex-derived Luce-Org#237 selective MTP salvage plan.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record 2026-05-31 03:14 cron preflight, current PR containment, fresh conflict probes for the remaining non-ancestor PRs, and the Luce-Org#237 Codex salvage audit.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 03:55 EDT unattended reconciliation run, including current PR-head containment, repeated conflict probes for the six remaining non-ancestor PRs, and the fresh Luce-Org#237/Luce-Org#135 delegated audit attempts.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 cron preflight, exact PR containment check, repeated conflict probes, and the new Codex feasibility summary for PR Luce-Org#237's current-layout MTP port.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port the safe current-layout common MTP interface, chain runner, and orchestrator slice from PR Luce-Org#237 without Qwen-specific runtime wiring. Add a common orchestrator regression test and CMake wiring.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the Luce-Org#237 common MTP foundation selective port, validation, remaining PR classification, and retained worktrees from the unattended integration run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants