Skip to content

Conversation

@blightbow
Copy link
Contributor

@blightbow blightbow commented Dec 13, 2025

Description

This PR fixes #6747 by implementing a thread-safe LRU prompt cache. The implementation is directly ported from mlx_lm's server.py, which embeds the code directly and does not provide a portable interface. mlx_lm's policy is that caching implementations should not be part of the core package, so the best we can do is copy the reference implementation. It's not ideal, but it is what it is.

The third commit undoes a mistaken assumption that more than one MLX backend used the caching implementation. mlx-vl can't use this, so

MLX Backend Enhancements

Commits (4)

  1. feat(mlx): add thread-safe LRU prompt cache
  2. feat(mlx): add min_p and top_k sampler support
  3. refactor(mlx): move mlx_cache.py from common to mlx backend
  4. test(mlx): add comprehensive cache tests and document upstream behavior

Features

Thread-Safe LRU Prompt Cache (mlx_cache.py)

Ported from https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/server.py (MIT License, Copyright 2023-2024 Apple Inc.) with thread-safety additions for LocalAI's gRPC backend.

  • Trie-based prefix matching for KV cache reuse (exact, shorter prefix, longer prefix with trim)
  • Thread-safe via threading.Lock for gRPC concurrency
  • LRU eviction when max entries exceeded
  • Configurable via max_cache_entries and max_kv_size options

Sampler Support (backend.py)

  • Adds min_p and top_k sampling parameters
  • Adds XTC sampling support (xtc_threshold, xtc_probability)

Test Coverage

Comprehensive unit tests (test_mlx_cache.py - 23 tests):

  • All cache operation modes: exact match, shorter prefix, longer prefix trim, no match
  • LRU eviction and access order updates
  • Reference counting and deep copy behavior
  • Multi-model namespacing
  • Thread safety with data integrity verification

Integration tests (test.py):

  • Cache reuse, prefix cache reuse, concurrent requests

Files Changed

backend/python/mlx/backend.py | 147 ++++++++---
backend/python/mlx/mlx_cache.py | 266 +++++++++++++++++++ (new)
backend/python/mlx/test.py | 140 ++++++++--
backend/python/mlx/test_mlx_cache.py | 480 +++++++++++++++++++++++++++++++++++ (new)

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

@netlify
Copy link

netlify bot commented Dec 13, 2025

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 60c3b35
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/69409b519e242100075ae0d3
😎 Deploy Preview https://deploy-preview-7556--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

blightbow and others added 4 commits December 14, 2025 00:40
Port mlx-lm's LRUPromptCache to fix race condition where concurrent
requests corrupt shared KV cache state. The previous implementation
used a single prompt_cache instance shared across all requests.

Changes:
- Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache
- Modify backend.py to use per-request cache isolation via fetch/insert
- Add prefix matching for cache reuse across similar prompts
- Add LRU eviction (default 10 entries, configurable)
- Add concurrency and cache unit tests

The cache uses a trie-based structure for efficient prefix matching,
allowing prompts that share common prefixes to reuse cached KV states.
Thread safety is provided via threading.Lock.

New configuration options:
- max_cache_entries: Maximum LRU cache entries (default: 10)
- max_kv_size: Maximum KV cache size per entry (default: None)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Blightbow <[email protected]>
Add MinP field to proto (field 52) following the precedent set by
other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ,
TypicalP, and Mirostat.

Changes:
- backend.proto: Add float MinP field for min-p sampling
- backend.py: Extract and pass min_p and top_k to mlx_lm sampler
  (top_k was in proto but not being passed)
- test.py: Fix test_sampling_params to use valid proto fields and
  switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Blightbow <[email protected]>
The ThreadSafeLRUPromptCache is only used by the mlx backend. After
evaluating mlx-vlm, it was determined that the cache cannot be shared
because mlx-vlm's generate/stream_generate functions don't support
the prompt_cache parameter that mlx_lm provides.

- Move mlx_cache.py from backend/python/common/ to backend/python/mlx/
- Remove sys.path manipulation from backend.py and test.py
- Fix test assertion to expect "MLX model loaded successfully"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Blightbow <[email protected]>
Added comprehensive unit tests (test_mlx_cache.py) covering all cache
operation modes:
- Exact match
- Shorter prefix match
- Longer prefix match with trimming
- No match scenarios
- LRU eviction and access order
- Reference counting and deep copy behavior
- Multi-model namespacing
- Thread safety with data integrity verification

Documents upstream mlx_lm/server.py behavior: single-token prefixes are
deliberately not matched (uses > 0, not >= 0) to allow longer cached
sequences to be preferred for trimming. This is acceptable because real
prompts with chat templates are always many tokens.

Removed weak unit tests from test.py that only verified "no exception
thrown" rather than correctness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Blightbow <[email protected]>
@blightbow
Copy link
Contributor Author

Force pushed to address DCO.

@blightbow blightbow changed the title MLX backend cache rework feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling Dec 14, 2025
string ToolChoice = 49; // JSON string or object specifying tool choice behavior
int32 Logprobs = 50; // Number of top logprobs to return (maps to OpenAI logprobs parameter)
int32 TopLogprobs = 51; // Number of top logprobs to return per token (maps to OpenAI top_logprobs parameter)
float MinP = 52; // Min-p sampling: minimum probability threshold scaled by top token probability
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like used anywhere, otherwise changes looks good here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, and resynced against master.

blightbow and others added 2 commits December 15, 2025 18:16
The MinP field was added to PredictOptions but is not populated by the
Go frontend/API. The MLX backend uses getattr with a default value,
so it works without the proto field.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Blightbow <[email protected]>
Copy link
Owner

@mudler mudler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will try to test this, would be easier for everyone to test on development/master and pick it up from there

@mudler mudler merged commit 67baf66 into mudler:master Dec 16, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache corruption in mlx backend

2 participants