Add speculative decoding telemetry and memory gating by aleroot · Pull Request #314 · ml-explore/mlx-swift-lm

aleroot · 2026-05-27T13:15:40Z

Proposed changes

Speculative decoding is workload and hardware-sensitive. A draft model can look good in theory but fail to pay off if acceptance is low or if the extra model pressure is too high.

The change set prepares the project for edge-aware speculative decoding by adding the two missing foundations:

observe speculative decoding quality/performance
avoid applying auxiliary-model speculation when memory pressure makes it likely to hurt

Telemetry

It exposes whether speculative decoding is actually helping at runtime. Callers can now inspect acceptance rate, draft tokens, target verification calls, emitted tokens, and average emitted tokens per target call through GenerateCompletionInfo.speculativeDecodingTelemetry.

Memory-Aware Gating

ChatSession now defaults speculative decoding to a memory policy based on GPU.maxRecommendedWorkingSetBytes(). Before this, enabling speculative decoding meant loading/running the main model plus draft model whenever requested. On memory-constrained systems, that can make things slower or less stable.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

aleroot force-pushed the speculative-telemetry-memory-gating branch 2 times, most recently from 19cd054 to eb33b2d Compare May 30, 2026 03:44

Add speculative decoding telemetry and memory gating

a4a75f8

aleroot force-pushed the speculative-telemetry-memory-gating branch from eb33b2d to a4a75f8 Compare June 12, 2026 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speculative decoding telemetry and memory gating#314

Add speculative decoding telemetry and memory gating#314
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:speculative-telemetry-memory-gating

aleroot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aleroot commented May 27, 2026

Proposed changes

Telemetry

Memory-Aware Gating

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant