feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556

blightbow · 2025-12-13T12:19:48Z

Description

This PR fixes #6747 by implementing a thread-safe LRU prompt cache. The implementation is directly ported from mlx_lm's server.py, which embeds the code directly and does not provide a portable interface. mlx_lm's policy is that caching implementations should not be part of the core package, so the best we can do is copy the reference implementation. It's not ideal, but it is what it is.

The third commit undoes a mistaken assumption that more than one MLX backend used the caching implementation. mlx-vl can't use this, so

MLX Backend Enhancements

Commits (4)

feat(mlx): add thread-safe LRU prompt cache
feat(mlx): add min_p and top_k sampler support
refactor(mlx): move mlx_cache.py from common to mlx backend
test(mlx): add comprehensive cache tests and document upstream behavior

Features

Thread-Safe LRU Prompt Cache (mlx_cache.py)

Trie-based prefix matching for KV cache reuse (exact, shorter prefix, longer prefix with trim)
Thread-safe via threading.Lock for gRPC concurrency
LRU eviction when max entries exceeded
Configurable via max_cache_entries and max_kv_size options

Sampler Support (backend.py)

Adds min_p and top_k sampling parameters
Adds XTC sampling support (xtc_threshold, xtc_probability)

Test Coverage

Comprehensive unit tests (test_mlx_cache.py - 23 tests):

All cache operation modes: exact match, shorter prefix, longer prefix trim, no match
LRU eviction and access order updates
Reference counting and deep copy behavior
Multi-model namespacing
Thread safety with data integrity verification

Integration tests (test.py):

Cache reuse, prefix cache reuse, concurrent requests

Files Changed

backend/python/mlx/backend.py | 147 ++++++++---
backend/python/mlx/mlx_cache.py | 266 +++++++++++++++++++ (new)
backend/python/mlx/test.py | 140 ++++++++--
backend/python/mlx/test_mlx_cache.py | 480 +++++++++++++++++++++++++++++++++++ (new)

Notes for Reviewers

Signed commits

Yes, I signed my commits.

netlify · 2025-12-13T12:19:53Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`60c3b35`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/69409b519e242100075ae0d3
😎 Deploy Preview	https://deploy-preview-7556--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Port mlx-lm's LRUPromptCache to fix race condition where concurrent requests corrupt shared KV cache state. The previous implementation used a single prompt_cache instance shared across all requests. Changes: - Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache - Modify backend.py to use per-request cache isolation via fetch/insert - Add prefix matching for cache reuse across similar prompts - Add LRU eviction (default 10 entries, configurable) - Add concurrency and cache unit tests The cache uses a trie-based structure for efficient prefix matching, allowing prompts that share common prefixes to reuse cached KV states. Thread safety is provided via threading.Lock. New configuration options: - max_cache_entries: Maximum LRU cache entries (default: 10) - max_kv_size: Maximum KV cache size per entry (default: None) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>

Add MinP field to proto (field 52) following the precedent set by other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ, TypicalP, and Mirostat. Changes: - backend.proto: Add float MinP field for min-p sampling - backend.py: Extract and pass min_p and top_k to mlx_lm sampler (top_k was in proto but not being passed) - test.py: Fix test_sampling_params to use valid proto fields and switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>

The ThreadSafeLRUPromptCache is only used by the mlx backend. After evaluating mlx-vlm, it was determined that the cache cannot be shared because mlx-vlm's generate/stream_generate functions don't support the prompt_cache parameter that mlx_lm provides. - Move mlx_cache.py from backend/python/common/ to backend/python/mlx/ - Remove sys.path manipulation from backend.py and test.py - Fix test assertion to expect "MLX model loaded successfully" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>

Added comprehensive unit tests (test_mlx_cache.py) covering all cache operation modes: - Exact match - Shorter prefix match - Longer prefix match with trimming - No match scenarios - LRU eviction and access order - Reference counting and deep copy behavior - Multi-model namespacing - Thread safety with data integrity verification Documents upstream mlx_lm/server.py behavior: single-token prefixes are deliberately not matched (uses > 0, not >= 0) to allow longer cached sequences to be preferred for trimming. This is acceptable because real prompts with chat templates are always many tokens. Removed weak unit tests from test.py that only verified "no exception thrown" rather than correctness. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>

blightbow · 2025-12-14T05:47:32Z

Force pushed to address DCO.

mudler · 2025-12-14T11:47:23Z

backend/backend.proto

  string ToolChoice = 49;  // JSON string or object specifying tool choice behavior
  int32 Logprobs = 50;  // Number of top logprobs to return (maps to OpenAI logprobs parameter)
  int32 TopLogprobs = 51;  // Number of top logprobs to return per token (maps to OpenAI top_logprobs parameter)
+  float MinP = 52;  // Min-p sampling: minimum probability threshold scaled by top token probability


this doesn't look like used anywhere, otherwise changes looks good here

Fixed, and resynced against master.

The MinP field was added to PredictOptions but is not populated by the Go frontend/API. The MLX backend uses getattr with a default value, so it works without the proto field. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>

mudler

Thank you. I will try to test this, would be easier for everyone to test on development/master and pick it up from there

blightbow and others added 4 commits December 14, 2025 00:40

blightbow force-pushed the mlx_cache branch from d9c2bb8 to a796742 Compare December 14, 2025 05:41

blightbow changed the title ~~MLX backend cache rework~~ feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling Dec 14, 2025

mudler reviewed Dec 14, 2025

View reviewed changes

blightbow and others added 2 commits December 15, 2025 18:16

Merge branch 'mudler:master' into mlx_cache

861f6e3

mudler approved these changes Dec 16, 2025

View reviewed changes

mudler merged commit 67baf66 into mudler:master Dec 16, 2025
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556

feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556

blightbow commented Dec 13, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

blightbow commented Dec 14, 2025

Uh oh!

mudler Dec 14, 2025

Uh oh!

blightbow Dec 15, 2025

Uh oh!

mudler left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556

feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556

Conversation

blightbow commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLX Backend Enhancements

Commits (4)

Features

Thread-Safe LRU Prompt Cache (mlx_cache.py)

Sampler Support (backend.py)

Test Coverage

Comprehensive unit tests (test_mlx_cache.py - 23 tests):

Integration tests (test.py):

Files Changed

Uh oh!

netlify bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

blightbow commented Dec 14, 2025

Uh oh!

mudler Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

blightbow Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

mudler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blightbow commented Dec 13, 2025 •

edited

Loading

netlify bot commented Dec 13, 2025 •

edited

Loading

mudler left a comment •

edited

Loading