Skip to content

Add speed profiler#431

Open
officialgr wants to merge 18 commits into
Luce-Org:mainfrom
officialgr:add-speed-profiler
Open

Add speed profiler#431
officialgr wants to merge 18 commits into
Luce-Org:mainfrom
officialgr:add-speed-profiler

Conversation

@officialgr

@officialgr officialgr commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Adds a report-only speed profiling workflow for Lucebox inference changes.

Changes

  • Adds server/scripts/profile.py
  • Adds .github/workflows/speed-profile.yml
  • Adds profiler documentation at server/docs/speed-profile.md

Notes

  • Workflow is report-only / non-blocking.
  • Intended for the self-hosted RTX 3090 GPU runner.
  • Produces profile.json, profile.md, and nsys artifacts when available.

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 3 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/scripts/profile.py Outdated
Comment thread .github/workflows/speed-profile.yml
Comment thread server/scripts/profile.py Outdated
Comment thread server/scripts/profile.py Outdated
Comment thread .github/workflows/speed-profile.yml Outdated
- Losslessness gate now aggregates across all prompts (was only
  checking the first). A gate that can silently miss divergences
  on prompts 2+ is worse than no gate.
- Temp dir changed to mkdtemp() per run (unique path, no shared
  /tmp/lucebox_profile). Low real-world risk since CI serializes
  via the concurrency group, but a clean fix regardless.
- nsys per-token metrics now normalise by actual generated token
  count rather than requested n_gen. Note: per-token figures
  remain directional until the NVTX decode-window refactor (the
  nsys pass still covers load + prefill + decode together).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/scripts/profile.py
- Added "Reset GPU clocks" step (nvidia-smi -rgc, if: always())
  so the lock set at job start never persists on the shared runner
  between PRs.
- Guarded `cat server/profile.md` with a file-existence check so
  the summary step doesn't error when the profiler run fails.
The unique per-run temp dir (added to fix the earlier race) was never
removed, so repeated CI runs would grow /tmp unbounded. Register an
atexit cleanup that runs on success or crash; the kept artifacts
(json/md/nsys-rep) are written outside the temp dir before cleanup.
@davide221

Copy link
Copy Markdown
Contributor

@officialgr thanks for the profiler work, can you fix the CI error?

The speed-profile job built the binaries but invoked profile.py with the
bare system python3, which lacks `transformers` (used to tokenize prompts).
Install it into an isolated venv so the shared runner's Python isn't
polluted and the full project env (torch, etc.) isn't dragged in for a
tokenizer-only dependency.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread .github/workflows/speed-profile.yml Outdated
@officialgr

Copy link
Copy Markdown
Author

@davide221 Fixed — the job was running profile.py with the system python3, which doesn't have transformers (used for tokenization). It now installs it into an isolated venv before running.

@officialgr

Copy link
Copy Markdown
Author

The failure was caused by the speed-profile workflow looking for the wrong speculative draft model path. The workflow now defaults to the repo-documented Qwen3.6 draft GGUF path and skips cleanly with diagnostics if runner-local weights are missing.

@davide221

Copy link
Copy Markdown
Contributor

Nice progress, it now runs end-to-end on the 3090 and is already useful. Two things to look at in your own passing run before this is mergeable:

  1. The lossless gate reports FAIL (2/3 prompts diverge) but the job is still green. Is that a real correctness bug or too-strict a check?
  2. The nsys section prints 0.0 everywhere with an empty table

A couple of things would make this much more valuable: a baseline + threshold so it can actually flag a regression and a thought on whether we should also profile dflash_server (the shipping path), since test_dflash can't see the prefix cache / batching.

Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread .github/workflows/speed-profile.yml Outdated
Comment thread server/scripts/profile.py Outdated
Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended.
Treat missing kernel-level nsys stats as a profiling error even when other nsys sections parse successfully, preventing zero kernel metrics from being rendered as real data. Surface non-critical nsys section failures as Markdown warnings while preserving valid kernel metrics.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/scripts/profile.py">

<violation number="1" location="server/scripts/profile.py:446">
P2: NSYS warnings are silently dropped from markdown when a kernel error exists because `render_md` only renders warnings inside the `else` branch, but `summarize_nsys` can populate both `error` and `warnings` simultaneously.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/scripts/profile.py
…d-profile-scripts

Improve nsys diagnostics and warnings; fix Actions baseline argument expansion
@officialgr

Copy link
Copy Markdown
Author

@davide221
Ran it end-to-end on the 3090 at the shipping defaults (Qwen3.6-27B, budget 22, n_gen 48).

1. Lossless — gate is no longer "too strict." Added an AR-vs-AR determinism control (reuses the existing AR passes, no extra GPU cost): a prompt only FAILs when AR is self-deterministic and spec-decode still differs; intrinsically-nondeterministic prompts become inconclusive. On my run it flags a divergence at token #44 (2/3 prompts). I checked whether it's a tail artifact by re-running at n_gen 96 — it stayed at #44, so it's reproducible/structural, not a length effect. I'm deliberately not over-calling it: it's ruled out as run-to-run noise, but it could still be batched-verify FP (verify scores draft tokens as a batch vs AR one-at-a-time). The definitive bug-vs-FP test is the logit gap at #44 (near-tie = FP, clear gap = bug) — the binaries don't emit per-token logits yet, so I'm flagging it for the engine team rather than claiming a bug.

2. nsys — fixed, now fully populated. Root cause was report-name drift across nsys versions (cuda_gpu_kern_sum ↔ legacy gpukernsum) plus a swallowed error. It now tries both name sets and, if stats still won't parse, prints the nsys error + version instead of silent zeros. On Qwen3.6: 568.8 ms GPU kernel time, 674.6 launches/token, top kernel mul_mat_q (187 ms / 2880 launches), 75.7 ms/token in host↔device copies.

3. Baseline + threshold — added. --baseline + --regress-pct (default 10%, wide enough for run-to-run variance), direction-aware (tok/s & AL up = good; ms/token & TTFT down = good). The report flags ⚠️ REGRESSION and CI emits a non-blocking warning. Verified the delta path on a self-vs-self run (✅ within threshold). Seeded server/scripts/speed-baseline.json from the green run.

4. dflash_server — recommendation (no code, since you asked for "a thought"). Keep the binaries as the per-PR signal (deterministic, low-noise, isolates compute). The shipping path's prefix-cache + batching is a different question (throughput under load), so I'd add it as a separate, higher-variance Phase 2 — replay a request trace at fixed concurrency; TTFT/p50/p99 + cache-hit rate — on a schedule/label, not every PR. Written up in docs/specs/speed-profile.md.

Final note on n_gen. n_gen=48 is good for CI speed, but it is too short to give the most representative throughput signal — in my runs it stays around ~0.98 and is more sensitive to setup/early-step overhead. The 128–256 range works much better and gives more stable speed numbers. I’d recommend moving the default to 128 for a balanced PR profile, or 256 if we want the profile to favor benchmark quality over runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants