Skip to content

feat(gemma4): wire prefill/decode timing into GenerateResult#287

Merged
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:feat/gemma4-timings
May 27, 2026
Merged

feat(gemma4): wire prefill/decode timing into GenerateResult#287
davide221 merged 1 commit into
Luce-Org:mainfrom
easel:feat/gemma4-timings

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented May 27, 2026

Mirrors qwen35_backend.cpp commit 3b80fa8.

Before this, every gemma4 /v1/chat/completions response reported
usage.timings.prefill_ms = 0 and decode_tokens_per_sec = 0, so
benchmarks couldn't measure prefill/decode wall time on gemma4 targets.
qwen35_backend.cpp got this instrumentation when 3b80fa8 landed;
gemma4 was missed in that pass.

What's in this PR

  • server/src/gemma4/gemma4_backend.cpp (+18 lines):
    • generate(): wrap do_prefill in steady_clock and stamp
      result.prefill_s. Same for the decode loop → result.decode_s.
    • restore_and_generate(): same for the delta-prefill + decode path.

http_server.cpp already reads result.prefill_s / result.decode_s
via the GenTimings struct and surfaces them in usage.timings; no
server-side changes needed.

Validation

Sindri gemma-4-26b ds4-eval bench (PR #285 territory) now reports
non-zero decode_tokens_per_sec per case (typical: 30–80 tok/s with
spec-decode, 12–20 tok/s AR-only).

… qwen35 3b80fa8)

Gemma4Backend's generate() and restore_and_generate() never populated
result.prefill_s or result.decode_s, so usage.timings.{prefill_ms,
decode_ms, decode_tokens_per_sec} surfaced as 0.0 for every gemma4
request. Bench tooling that aggregates per-case decode rates was
silently falling back to wall-time math, conflating prefill + HTTP
overhead with decode.

Add steady_clock measurements around the do_prefill call and around
the entire decode block (both spec-decode and AR fallback paths) in
both entry points. Same shape as qwen35's instrumentation from
commit 3b80fa8.

Verified on bragi (RTX 5090 Laptop) with gemma-4-31b-it Q4_K_M:
  prompt=30 comp=400 wall=19.66s
  timings: prefill_ms=117.4 decode_ms=19534.5 decode_tps=20.5

(The earlier "2.5 tok/s" number for 31b was a request queued behind an
in-flight sweep — wall-based math gave a wrong rate. The server's
ar_decode log line already showed ~19 tok/s consistently; this commit
makes the same number flow through to the API response.)
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.

Re-trigger cubic

Copy link
Copy Markdown
Contributor

@howard0su howard0su left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davide221 davide221 merged commit 0ed3526 into Luce-Org:main May 27, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants