feat(gemma4): wire prefill/decode timing into GenerateResult by easel · Pull Request #287 · Luce-Org/lucebox-hub

easel · 2026-05-27T18:16:30Z

Mirrors qwen35_backend.cpp commit 3b80fa8.

Before this, every gemma4 /v1/chat/completions response reported
usage.timings.prefill_ms = 0 and decode_tokens_per_sec = 0, so
benchmarks couldn't measure prefill/decode wall time on gemma4 targets.
qwen35_backend.cpp got this instrumentation when 3b80fa8 landed;
gemma4 was missed in that pass.

What's in this PR

server/src/gemma4/gemma4_backend.cpp (+18 lines):
- generate(): wrap do_prefill in steady_clock and stamp
  result.prefill_s. Same for the decode loop → result.decode_s.
- restore_and_generate(): same for the delta-prefill + decode path.

http_server.cpp already reads result.prefill_s / result.decode_s
via the GenTimings struct and surfaces them in usage.timings; no
server-side changes needed.

Validation

Sindri gemma-4-26b ds4-eval bench (PR #285 territory) now reports
non-zero decode_tokens_per_sec per case (typical: 30–80 tok/s with
spec-decode, 12–20 tok/s AR-only).

… qwen35 3b80fa8) Gemma4Backend's generate() and restore_and_generate() never populated result.prefill_s or result.decode_s, so usage.timings.{prefill_ms, decode_ms, decode_tokens_per_sec} surfaced as 0.0 for every gemma4 request. Bench tooling that aggregates per-case decode rates was silently falling back to wall-time math, conflating prefill + HTTP overhead with decode. Add steady_clock measurements around the do_prefill call and around the entire decode block (both spec-decode and AR fallback paths) in both entry points. Same shape as qwen35's instrumentation from commit 3b80fa8. Verified on bragi (RTX 5090 Laptop) with gemma-4-31b-it Q4_K_M: prompt=30 comp=400 wall=19.66s timings: prefill_ms=117.4 decode_ms=19534.5 decode_tps=20.5 (The earlier "2.5 tok/s" number for 31b was a request queued behind an in-flight sweep — wall-based math gave a wrong rate. The server's ar_decode log line already showed ~19 tok/s consistently; this commit makes the same number flow through to the API response.)

cubic-dev-ai

No issues found across 1 file

_{You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.}

_{Re-trigger cubic}

howard0su

LGTM

cubic-dev-ai Bot reviewed May 27, 2026

View reviewed changes

howard0su approved these changes May 27, 2026

View reviewed changes

davide221 merged commit 0ed3526 into Luce-Org:main May 27, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemma4): wire prefill/decode timing into GenerateResult#287

feat(gemma4): wire prefill/decode timing into GenerateResult#287
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:feat/gemma4-timings

easel commented May 27, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

howard0su left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

easel commented May 27, 2026

What's in this PR

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

howard0su left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants