Add speed profiler#431
Conversation
There was a problem hiding this comment.
5 issues found across 3 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- Losslessness gate now aggregates across all prompts (was only checking the first). A gate that can silently miss divergences on prompts 2+ is worse than no gate. - Temp dir changed to mkdtemp() per run (unique path, no shared /tmp/lucebox_profile). Low real-world risk since CI serializes via the concurrency group, but a clean fix regardless. - nsys per-token metrics now normalise by actual generated token count rather than requested n_gen. Note: per-token figures remain directional until the NVTX decode-window refactor (the nsys pass still covers load + prefill + decode together).
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
- Added "Reset GPU clocks" step (nvidia-smi -rgc, if: always()) so the lock set at job start never persists on the shared runner between PRs. - Guarded `cat server/profile.md` with a file-existence check so the summary step doesn't error when the profiler run fails.
The unique per-run temp dir (added to fix the earlier race) was never removed, so repeated CI runs would grow /tmp unbounded. Register an atexit cleanup that runs on success or crash; the kept artifacts (json/md/nsys-rep) are written outside the temp dir before cleanup.
|
@officialgr thanks for the profiler work, can you fix the CI error? |
The speed-profile job built the binaries but invoked profile.py with the bare system python3, which lacks `transformers` (used to tokenize prompts). Install it into an isolated venv so the shared runner's Python isn't polluted and the full project env (torch, etc.) isn't dragged in for a tokenizer-only dependency.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
|
@davide221 Fixed — the job was running profile.py with the system python3, which doesn't have transformers (used for tokenization). It now installs it into an isolated venv before running. |
…le.yml Pin profiler deps and add model validation to speed-profile workflow
|
The failure was caused by the speed-profile workflow looking for the wrong speculative draft model path. The workflow now defaults to the repo-documented Qwen3.6 draft GGUF path and skips cleanly with diagnostics if runner-local weights are missing. |
|
Nice progress, it now runs end-to-end on the 3090 and is already useful. Two things to look at in your own passing run before this is mergeable:
A couple of things would make this much more valuable: a baseline + threshold so it can actually flag a regression and a thought on whether we should also profile dflash_server (the shipping path), since test_dflash can't see the prefix cache / batching. |
Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions.
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended.
Treat missing kernel-level nsys stats as a profiling error even when other nsys sections parse successfully, preventing zero kernel metrics from being rendered as real data. Surface non-critical nsys section failures as Markdown warnings while preserving valid kernel metrics.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/scripts/profile.py">
<violation number="1" location="server/scripts/profile.py:446">
P2: NSYS warnings are silently dropped from markdown when a kernel error exists because `render_md` only renders warnings inside the `else` branch, but `summarize_nsys` can populate both `error` and `warnings` simultaneously.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…d-profile-scripts Improve nsys diagnostics and warnings; fix Actions baseline argument expansion
|
@davide221 1. Lossless — gate is no longer "too strict." Added an AR-vs-AR determinism control (reuses the existing AR passes, no extra GPU cost): a prompt only FAILs when AR is self-deterministic and spec-decode still differs; intrinsically-nondeterministic prompts become inconclusive. On my run it flags a divergence at token #44 (2/3 prompts). I checked whether it's a tail artifact by re-running at n_gen 96 — it stayed at #44, so it's reproducible/structural, not a length effect. I'm deliberately not over-calling it: it's ruled out as run-to-run noise, but it could still be batched-verify FP (verify scores draft tokens as a batch vs AR one-at-a-time). The definitive bug-vs-FP test is the logit gap at #44 (near-tie = FP, clear gap = bug) — the binaries don't emit per-token logits yet, so I'm flagging it for the engine team rather than claiming a bug. 2. nsys — fixed, now fully populated. Root cause was report-name drift across nsys versions ( 3. Baseline + threshold — added. 4. dflash_server — recommendation (no code, since you asked for "a thought"). Keep the binaries as the per-PR signal (deterministic, low-noise, isolates compute). The shipping path's prefix-cache + batching is a different question (throughput under load), so I'd add it as a separate, higher-variance Phase 2 — replay a request trace at fixed concurrency; TTFT/p50/p99 + cache-hit rate — on a schedule/label, not every PR. Written up in Final note on |
Summary
Adds a report-only speed profiling workflow for Lucebox inference changes.
Changes
server/scripts/profile.py.github/workflows/speed-profile.ymlserver/docs/speed-profile.mdNotes
profile.json,profile.md, and nsys artifacts when available.