Add speed profiler by officialgr · Pull Request #431 · Luce-Org/lucebox-hub

officialgr · 2026-06-20T16:23:30Z

Summary

Adds a report-only speed profiling workflow for Lucebox inference changes.

Changes

Adds server/scripts/profile.py
Adds .github/workflows/speed-profile.yml
Adds profiler documentation at server/docs/speed-profile.md

Notes

Workflow is report-only / non-blocking.
Intended for the self-hosted RTX 3090 GPU runner.
Produces profile.json, profile.md, and nsys artifacts when available.

cubic-dev-ai

5 issues found across 3 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- Losslessness gate now aggregates across all prompts (was only checking the first). A gate that can silently miss divergences on prompts 2+ is worse than no gate. - Temp dir changed to mkdtemp() per run (unique path, no shared /tmp/lucebox_profile). Low real-world risk since CI serializes via the concurrency group, but a clean fix regardless. - nsys per-token metrics now normalise by actual generated token count rather than requested n_gen. Note: per-token figures remain directional until the NVTX decode-window refactor (the nsys pass still covers load + prefill + decode together).

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- Added "Reset GPU clocks" step (nvidia-smi -rgc, if: always()) so the lock set at job start never persists on the shared runner between PRs. - Guarded `cat server/profile.md` with a file-existence check so the summary step doesn't error when the profiler run fails.

The unique per-run temp dir (added to fix the earlier race) was never removed, so repeated CI runs would grow /tmp unbounded. Register an atexit cleanup that runs on success or crash; the kept artifacts (json/md/nsys-rep) are written outside the temp dir before cleanup.

davide221 · 2026-06-22T09:27:57Z

@officialgr thanks for the profiler work, can you fix the CI error?

The speed-profile job built the binaries but invoked profile.py with the bare system python3, which lacks `transformers` (used to tokenize prompts). Install it into an isolated venv so the shared runner's Python isn't polluted and the full project env (torch, etc.) isn't dragged in for a tokenizer-only dependency.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

officialgr · 2026-06-22T10:12:21Z

@davide221 Fixed — the job was running profile.py with the system python3, which doesn't have transformers (used for tokenization). It now installs it into an isolated venv before running.

…le.yml Pin profiler deps and add model validation to speed-profile workflow

officialgr · 2026-06-22T12:13:51Z

The failure was caused by the speed-profile workflow looking for the wrong speculative draft model path. The workflow now defaults to the repo-documented Qwen3.6 draft GGUF path and skips cleanly with diagnostics if runner-local weights are missing.

davide221 · 2026-06-22T12:37:09Z

Nice progress, it now runs end-to-end on the 3090 and is already useful. Two things to look at in your own passing run before this is mergeable:

The lossless gate reports FAIL (2/3 prompts diverge) but the job is still green. Is that a real correctness bug or too-strict a check?
The nsys section prints 0.0 everywhere with an empty table

A couple of things would make this much more valuable: a baseline + threshold so it can actually flag a regression and a thought on whether we should also profile dflash_server (the shipping path), since test_dflash can't see the prefix cache / batching.

Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions.

cubic-dev-ai

2 issues found across 3 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended.

Treat missing kernel-level nsys stats as a profiling error even when other nsys sections parse successfully, preventing zero kernel metrics from being rendered as real data. Surface non-critical nsys section failures as Markdown warnings while preserving valid kernel metrics.

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/scripts/profile.py">

<violation number="1" location="server/scripts/profile.py:446">
P2: NSYS warnings are silently dropped from markdown when a kernel error exists because `render_md` only renders warnings inside the `else` branch, but `summarize_nsys` can populate both `error` and `warnings` simultaneously.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

…d-profile-scripts Improve nsys diagnostics and warnings; fix Actions baseline argument expansion

officialgr · 2026-06-23T00:46:05Z

@davide221
Ran it end-to-end on the 3090 at the shipping defaults (Qwen3.6-27B, budget 22, n_gen 48).

1. Lossless — gate is no longer "too strict." Added an AR-vs-AR determinism control (reuses the existing AR passes, no extra GPU cost): a prompt only FAILs when AR is self-deterministic and spec-decode still differs; intrinsically-nondeterministic prompts become inconclusive. On my run it flags a divergence at token #44 (2/3 prompts). I checked whether it's a tail artifact by re-running at n_gen 96 — it stayed at #44, so it's reproducible/structural, not a length effect. I'm deliberately not over-calling it: it's ruled out as run-to-run noise, but it could still be batched-verify FP (verify scores draft tokens as a batch vs AR one-at-a-time). The definitive bug-vs-FP test is the logit gap at #44 (near-tie = FP, clear gap = bug) — the binaries don't emit per-token logits yet, so I'm flagging it for the engine team rather than claiming a bug.

2. nsys — fixed, now fully populated. Root cause was report-name drift across nsys versions (cuda_gpu_kern_sum ↔ legacy gpukernsum) plus a swallowed error. It now tries both name sets and, if stats still won't parse, prints the nsys error + version instead of silent zeros. On Qwen3.6: 568.8 ms GPU kernel time, 674.6 launches/token, top kernel mul_mat_q (187 ms / 2880 launches), 75.7 ms/token in host↔device copies.

3. Baseline + threshold — added. --baseline + --regress-pct (default 10%, wide enough for run-to-run variance), direction-aware (tok/s & AL up = good; ms/token & TTFT down = good). The report flags ⚠️ REGRESSION and CI emits a non-blocking warning. Verified the delta path on a self-vs-self run (✅ within threshold). Seeded server/scripts/speed-baseline.json from the green run.

4. dflash_server — recommendation (no code, since you asked for "a thought"). Keep the binaries as the per-PR signal (deterministic, low-noise, isolates compute). The shipping path's prefix-cache + batching is a different question (throughput under load), so I'd add it as a separate, higher-variance Phase 2 — replay a request trace at fixed concurrency; TTFT/p50/p99 + cache-hit rate — on a schedule/label, not every PR. Written up in docs/specs/speed-profile.md.

Final note on n_gen. n_gen=48 is good for CI speed, but it is too short to give the most representative throughput signal — in my runs it stays around ~0.98 and is more sensitive to setup/early-step overhead. The 128–256 range works much better and gives more stable speed numbers. I’d recommend moving the default to 128 for a balanced PR profile, or 256 if we want the profile to favor benchmark quality over runtime.

officialgr added 3 commits June 20, 2026 18:14

Add report-only GPU speed profiling workflow

7679e76

Add Lucebox speed profiler

e6631b5

Document speed profiler workflow

0f2b2a2

cubic-dev-ai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread server/scripts/profile.py Outdated

Comment thread .github/workflows/speed-profile.yml

Comment thread server/scripts/profile.py Outdated

Comment thread server/scripts/profile.py Outdated

Comment thread .github/workflows/speed-profile.yml Outdated

cubic-dev-ai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread server/scripts/profile.py

officialgr added 2 commits June 20, 2026 19:50

cubic-dev-ai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread .github/workflows/speed-profile.yml Outdated

officialgr added 3 commits June 22, 2026 12:50

Fix speed profile model and pin deps

1ff1b8b

Merge pull request #1 from officialgr/codex/fix-issues-in-speed-profi…

0868bc2

…le.yml Pin profiler deps and add model validation to speed-profile workflow

Update speed profile workflow for model paths and checks

8ab8539

officialgr added 3 commits June 23, 2026 01:47

Update profile.py

85db576

Revise speed profile documentation with new parameters

18a5c26

Updated parameters and CI details for speed profiling, including losslessness gate and local run instructions.

Enhance speed profiling workflow with regression checks

af9dd44

cubic-dev-ai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread .github/workflows/speed-profile.yml Outdated

Comment thread server/scripts/profile.py Outdated

officialgr added 2 commits June 23, 2026 02:20

Quote speed profile baseline args safely

2f7a9d0

Build the optional baseline arguments as a Bash array and expand them with "${baseline_arg[@]}" so baseline paths containing spaces or shell metacharacters are passed to the profiler as intended.

cubic-dev-ai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread server/scripts/profile.py

officialgr added 3 commits June 23, 2026 02:30

Render nsys warnings with errors

286328a

Merge pull request #2 from officialgr/codex/fix-issues-in-workflow-an…

8428631

…d-profile-scripts Improve nsys diagnostics and warnings; fix Actions baseline argument expansion

Fix formatting in speed-profile.md

c91b30e

Conversation

officialgr commented Jun 20, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davide221 commented Jun 22, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

officialgr commented Jun 22, 2026

Uh oh!

officialgr commented Jun 22, 2026

Uh oh!

davide221 commented Jun 22, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

officialgr commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

officialgr commented Jun 20, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading