Skip to content

Add speculative decoding telemetry and memory gating#314

Open
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:speculative-telemetry-memory-gating
Open

Add speculative decoding telemetry and memory gating#314
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:speculative-telemetry-memory-gating

Conversation

@aleroot

@aleroot aleroot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Proposed changes

Speculative decoding is workload and hardware-sensitive. A draft model can look good in theory but fail to pay off if acceptance is low or if the extra model pressure is too high.

The change set prepares the project for edge-aware speculative decoding by adding the two missing foundations:

  1. observe speculative decoding quality/performance
  2. avoid applying auxiliary-model speculation when memory pressure makes it likely to hurt

Telemetry

It exposes whether speculative decoding is actually helping at runtime. Callers can now inspect acceptance rate, draft tokens, target verification calls, emitted tokens, and average emitted tokens per target call through GenerateCompletionInfo.speculativeDecodingTelemetry.

Memory-Aware Gating

ChatSession now defaults speculative decoding to a memory policy based on GPU.maxRecommendedWorkingSetBytes(). Before this, enabling speculative decoding meant loading/running the main model plus draft model whenever requested. On memory-constrained systems, that can make things slower or less stable.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@aleroot aleroot force-pushed the speculative-telemetry-memory-gating branch 2 times, most recently from 19cd054 to eb33b2d Compare May 30, 2026 03:44
@aleroot aleroot force-pushed the speculative-telemetry-memory-gating branch from eb33b2d to a4a75f8 Compare June 12, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant