You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two cleanups on the Qwen3 decode path, no intended change to production serving.
Before
The split-KV decode config (chunk-size formula, per-request cap, and a split_kv_* label) was spread across runtime constants, a duplicated formula, and a hardcoded split_kv_256x64 label at several call sites, which could silently desync.
qwen3_model_report timed decode projection GEMMs in a fresh untuned context, so under the default Tuned policy it measured plain GemmEx, not the algo production runs.
After
Split-KV config is single-sourced: SplitKvConfig (formula/label/parse) lives in the qwen3 crate (src/split_kv.rs) and openinfer-core carries a typed PagedDecodePath; the trace records the runtime-resolved chunk/cap as attrs instead of a hardcoded label, which the report reads back, failing loud if missing or divergent.
The report routes projection GEMM measures through the production --policy and launch_gemm: Pin/PerToken are measured faithfully; Tuned is flagged unfaithful_gemmex and excluded from the totals (pointing at --policy pin). It records measured_split_kv and a test asserts it equals the recorded attrs, so the measure cannot silently re-derive chunk/cap from kv_len. The default output path is policy-keyed.
Core move is right: routing the report's projection-GEMM measure through numeric_policy() and replaying split-KV scalars verbatim from synced_split_kv → PagedDecodePath attrs is a clean single-writer/single-reader channel. Problem is the two tests are re-verifying invariants the types already own.
pin_trace_chunk_size — delete
The invariant ("trace records the resolved chunk, not a kv_len re-derivation") is already in PagedDecodePath::SplitKv { chunk_size, cap } + the one-write/one-read synced_split_kv. The test is a full GPU + safetensors run verifying a field copy. The large = 8192 pick also hides assumptions about max_position_embeddings that break silently when max_pos / 256 == 8192 / 64.
report_gemm_faithful — push most of it into types
The Tuned→unfaithful_gemmex, Pin/PerToken→totals, total_is_partial chain is a runtime if over an ad-hoc string check — a contract the type system should own. Dispatch measure_catalog on policy so a Tuned GEMM measure returns Excluded, not LatencyStats; total_is_partial then falls out of by_op, not a post-hoc coverage_rows scan.
What's left of the test then is pin_served > 0 — a genuine runtime observation the compiler cannot reach, worth its CI cost. Note the current PerToken branch is a false-positive gate: it asserts pin_served == 0 + measured but never observes that PerToken served anything PerToken-specific, so it would pass for a --policy per-token silently falling back to GemmEx.
Minor
schema: 5 jumps 3 versions; note "local-only trials, shipping at 5" or bump one.
AttentionDecodeCase::new(batch, kv_len) is dead and inherits the now-drifted DEFAULT_SPLIT_KV_CONFIG. Delete it.
SplitKvConfig::new is const fn with zero guard; actual_chunk_size panics on new(64, 0). Reject 0 (drop const) or return usize::MAX on zero cap.
Request changes. Runtime plumbing is good; tests are carrying invariants the types already own.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Refs #414, #435. Follows #462.
Two cleanups on the Qwen3 decode path, no intended change to production serving.
Before
split_kv_*label) was spread across runtime constants, a duplicated formula, and a hardcodedsplit_kv_256x64label at several call sites, which could silently desync.qwen3_model_reporttimed decode projection GEMMs in a fresh untuned context, so under the default Tuned policy it measured plainGemmEx, not the algo production runs.After
SplitKvConfig(formula/label/parse) lives in the qwen3 crate (src/split_kv.rs) andopeninfer-corecarries a typedPagedDecodePath; the trace records the runtime-resolved chunk/cap as attrs instead of a hardcoded label, which the report reads back, failing loud if missing or divergent.--policyandlaunch_gemm: Pin/PerToken are measured faithfully; Tuned is flaggedunfaithful_gemmexand excluded from the totals (pointing at--policy pin). It recordsmeasured_split_kvand a test asserts it equals the recorded attrs, so the measure cannot silently re-derive chunk/cap from kv_len. The default output path is policy-keyed.Test Env
test suite passed on sm_89, x86_64.
Type of Change