-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
feat(mlx): add thread-safe LRU prompt cache and min_p/top_k sampling #7556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Port mlx-lm's LRUPromptCache to fix race condition where concurrent requests corrupt shared KV cache state. The previous implementation used a single prompt_cache instance shared across all requests. Changes: - Add backend/python/common/mlx_cache.py with ThreadSafeLRUPromptCache - Modify backend.py to use per-request cache isolation via fetch/insert - Add prefix matching for cache reuse across similar prompts - Add LRU eviction (default 10 entries, configurable) - Add concurrency and cache unit tests The cache uses a trie-based structure for efficient prefix matching, allowing prompts that share common prefixes to reuse cached KV states. Thread safety is provided via threading.Lock. New configuration options: - max_cache_entries: Maximum LRU cache entries (default: 10) - max_kv_size: Maximum KV cache size per entry (default: None) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>
Add MinP field to proto (field 52) following the precedent set by other non-OpenAI sampling parameters like TopK, TailFreeSamplingZ, TypicalP, and Mirostat. Changes: - backend.proto: Add float MinP field for min-p sampling - backend.py: Extract and pass min_p and top_k to mlx_lm sampler (top_k was in proto but not being passed) - test.py: Fix test_sampling_params to use valid proto fields and switch to MLX-compatible model (mlx-community/Llama-3.2-1B-Instruct) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>
The ThreadSafeLRUPromptCache is only used by the mlx backend. After evaluating mlx-vlm, it was determined that the cache cannot be shared because mlx-vlm's generate/stream_generate functions don't support the prompt_cache parameter that mlx_lm provides. - Move mlx_cache.py from backend/python/common/ to backend/python/mlx/ - Remove sys.path manipulation from backend.py and test.py - Fix test assertion to expect "MLX model loaded successfully" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>
Added comprehensive unit tests (test_mlx_cache.py) covering all cache operation modes: - Exact match - Shorter prefix match - Longer prefix match with trimming - No match scenarios - LRU eviction and access order - Reference counting and deep copy behavior - Multi-model namespacing - Thread safety with data integrity verification Documents upstream mlx_lm/server.py behavior: single-token prefixes are deliberately not matched (uses > 0, not >= 0) to allow longer cached sequences to be preferred for trimming. This is acceptable because real prompts with chat templates are always many tokens. Removed weak unit tests from test.py that only verified "no exception thrown" rather than correctness. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>
|
Force pushed to address DCO. |
backend/backend.proto
Outdated
| string ToolChoice = 49; // JSON string or object specifying tool choice behavior | ||
| int32 Logprobs = 50; // Number of top logprobs to return (maps to OpenAI logprobs parameter) | ||
| int32 TopLogprobs = 51; // Number of top logprobs to return per token (maps to OpenAI top_logprobs parameter) | ||
| float MinP = 52; // Min-p sampling: minimum probability threshold scaled by top token probability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look like used anywhere, otherwise changes looks good here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, and resynced against master.
The MinP field was added to PredictOptions but is not populated by the Go frontend/API. The MLX backend uses getattr with a default value, so it works without the proto field. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Blightbow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I will try to test this, would be easier for everyone to test on development/master and pick it up from there
Description
This PR fixes #6747 by implementing a thread-safe LRU prompt cache. The implementation is directly ported from mlx_lm's server.py, which embeds the code directly and does not provide a portable interface. mlx_lm's policy is that caching implementations should not be part of the core package, so the best we can do is copy the reference implementation. It's not ideal, but it is what it is.
The third commit undoes a mistaken assumption that more than one MLX backend used the caching implementation. mlx-vl can't use this, so
MLX Backend Enhancements
Commits (4)
Features
Thread-Safe LRU Prompt Cache (mlx_cache.py)
Ported from https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/server.py (MIT License, Copyright 2023-2024 Apple Inc.) with thread-safety additions for LocalAI's gRPC backend.
Sampler Support (backend.py)
Test Coverage
Comprehensive unit tests (test_mlx_cache.py - 23 tests):
Integration tests (test.py):
Files Changed
backend/python/mlx/backend.py | 147 ++++++++---
backend/python/mlx/mlx_cache.py | 266 +++++++++++++++++++ (new)
backend/python/mlx/test.py | 140 ++++++++--
backend/python/mlx/test_mlx_cache.py | 480 +++++++++++++++++++++++++++++++++++ (new)
Notes for Reviewers
Signed commits