uv run --directory dflash python scripts/run.py --prompt "def fibonacci(n):"
[run] prompt 14 tokens, streaming up to 256 tokens, max_ctx=512
[cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=22 temp=1.00 chain_seed=1 fa_window=2048 draft_swa=0 draft_ctx_max=4096 draft_feature_mirror=0 peer_access=0 target_gpu=0 draft_gpu=0
[loader] eos_id=248046 eos_chat_id=-1
[target] target loaded: layers [0,64) output=1, 850 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
[draft] loaded
[prompt] 14 tokens
[prefill] token-seg ubatch=256
[prefill] 14 tokens in 0.10 s, last_tok=0
[migrate] 242.89 ms
[dbg sib step 0] N=23 accept=1 walked_sib=0
walk: 0
[step 0] committed=14 last_tok=0 tree_N=23 accept=1 next=-1
[timing] per-step averages over 0 steps (ms):
draft_build 0.27
draft_copyfeat 0.07
draft_set 0.02
draft_compute 10.16
draft_bridge 0.01
draft_logits 9.47
snapshot_ssm 0.00
verify_build 1.68
verify_set 0.25
verify_compute 93.74
verify_logits 0.00
accept 0.04
restore_ssm 0.00
replay_build 0.00
replay_set 0.00
replay_compute 0.00
replay_logits 0.00
mirror_sync 0.00
----- sum 115.71
[dflash] generated 0 tokens in 0.116 s -> 0.00 tok/s
[dflash] 0 draft steps, accepted=0/0 (0.0% per step), avg commit/step=0.00
[dflash] output tail: 248045 846 198 727 73111 1393 1590 248046 198 248045 74455 198 248068 198 0
!ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32501 MiB):
Device 0: Tesla V100-PCIE-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
[run] generated 1 tokens
uv run --directory dflash python scripts/bench_llm.py
[bench] target = /media/per/work/tmp/lucebox-hub/dflash/models/Qwen3.6-27B-Q4_K_M.gguf
[bench] draft = /media/per/work/tmp/lucebox-hub/dflash/models/draft/dflash-draft-3.6-q8_0.gguf
[bench] ar bin = /media/per/work/tmp/lucebox-hub/dflash/build/test_generate
[bench] df bin = /media/per/work/tmp/lucebox-hub/dflash/build/test_dflash
[bench] tokenizer = Qwen/Qwen3.5-27B
[bench] budget = 22
[bench] ==== HumanEval (n=10, n_gen=256) ====
[01/10] n_tok= 92 AR= 30.00 DFlash= 0.00 AL= 0.00