feat(gemma4): wire prefill/decode timing into GenerateResult#287
Merged
Conversation
… qwen35 3b80fa8) Gemma4Backend's generate() and restore_and_generate() never populated result.prefill_s or result.decode_s, so usage.timings.{prefill_ms, decode_ms, decode_tokens_per_sec} surfaced as 0.0 for every gemma4 request. Bench tooling that aggregates per-case decode rates was silently falling back to wall-time math, conflating prefill + HTTP overhead with decode. Add steady_clock measurements around the do_prefill call and around the entire decode block (both spec-decode and AR fallback paths) in both entry points. Same shape as qwen35's instrumentation from commit 3b80fa8. Verified on bragi (RTX 5090 Laptop) with gemma-4-31b-it Q4_K_M: prompt=30 comp=400 wall=19.66s timings: prefill_ms=117.4 decode_ms=19534.5 decode_tps=20.5 (The earlier "2.5 tok/s" number for 31b was a request queued behind an in-flight sweep — wall-based math gave a wrong rate. The server's ar_decode log line already showed ~19 tok/s consistently; this commit makes the same number flow through to the API response.)
Contributor
There was a problem hiding this comment.
No issues found across 1 file
You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrors
qwen35_backend.cppcommit 3b80fa8.Before this, every gemma4
/v1/chat/completionsresponse reportedusage.timings.prefill_ms = 0anddecode_tokens_per_sec = 0, sobenchmarks couldn't measure prefill/decode wall time on gemma4 targets.
qwen35_backend.cppgot this instrumentation when 3b80fa8 landed;gemma4 was missed in that pass.
What's in this PR
server/src/gemma4/gemma4_backend.cpp(+18 lines):generate(): wrapdo_prefillinsteady_clockand stampresult.prefill_s. Same for the decode loop →result.decode_s.restore_and_generate(): same for the delta-prefill + decode path.http_server.cppalready readsresult.prefill_s/result.decode_svia the
GenTimingsstruct and surfaces them inusage.timings; noserver-side changes needed.
Validation
Sindri gemma-4-26b ds4-eval bench (PR #285 territory) now reports
non-zero
decode_tokens_per_secper case (typical: 30–80 tok/s withspec-decode, 12–20 tok/s AR-only).