-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add MLX support for Apple Silicon inference #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
chindris-mihai-alexandru
wants to merge
9
commits into
Alibaba-NLP:main
from
chindris-mihai-alexandru:feature/mlx-apple-silicon
Closed
Add MLX support for Apple Silicon inference #220
chindris-mihai-alexandru
wants to merge
9
commits into
Alibaba-NLP:main
from
chindris-mihai-alexandru:feature/mlx-apple-silicon
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Exa.ai provides AI-native neural search which offers significant advantages for research agents: - Semantic understanding: Finds relevant results based on meaning, not just keyword matching - Query optimization: Built-in autoprompt improves query quality - Direct content retrieval: Can fetch full page text in a single call - Better for complex queries: Neural embeddings excel at nuanced research questions This change simplifies the codebase by removing the dual search provider system and standardizing on Exa.ai.
- Add category parameter to filter results (research paper, news, github, etc.) - Add AI-generated highlights for better content extraction - Include author information in search results - Document all available Exa categories in docstrings
- Add run_mlx_react.py: React agent runner using MLX-lm server - Add run_mlx_infer.sh: Shell script to start MLX server and run inference - Add test_mlx_connection.py: Test script to verify MLX server connectivity - Update .env.example with MLX configuration options Enables running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using the MLX framework instead of CUDA/vLLM. Uses the 4-bit quantized model (abalogh/Tongyi-DeepResearch-30B-A3B-4bit, ~17GB) which fits in 32GB RAM. Tested on M1 Max with 32GB RAM - model loads and inference works correctly.
…ndling The MLX OpenAI-compatible server was not applying the chat template correctly, causing tool calls to fail. This change: - Uses native mlx_lm.load() and mlx_lm.generate() Python API directly - Builds chat prompts with proper Qwen format (<|im_start|>...<|im_end|>) - Uses make_sampler() for temperature/top_p settings (new mlx-lm API) - Removes server dependency - model is loaded directly in Python - Adds test_mlx_tool_loop.py for debugging tool call issues - Simplifies run_mlx_infer.sh (no server startup needed) Tested successfully: model now generates proper <tool_call> tags and the agent loop executes search/visit tools correctly.
…g, and timeout protection
… visit fallbacks - Add loop detection to break infinite tool call cycles - Track consecutive errors and force answer after 3 failures - Inject reminder at round 5 to encourage timely conclusions - Rewrite visit tool with raw content fallback when summarization unavailable - Add explicit answer behavior guidelines to system prompt - Create interactive CLI (interactive.py) for normal usage - Simplify tool descriptions for clarity
- Copy tool_args before passing to tools to prevent mutation - Remove unused imports (ThreadPoolExecutor, as_completed) - Add URL validation to reject invalid URLs early - Verified mlx-lm generate() API usage is correct
Author
|
Closing in favor of PR #222 which uses llama.cpp instead of MLX. llama.cpp is more mature and widely used than MLX, and doesn't have the chat template issues I encountered with MLX. The new PR provides the same local inference capability with better stability. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using Apple's MLX framework instead of CUDA/vLLM.
Changes
inference/run_mlx_react.py: New React agent runner that uses MLX-lm server's OpenAI-compatible APIinference/run_mlx_infer.sh: Shell script to start MLX server and run inferenceinference/test_mlx_connection.py: Test script to verify MLX server connectivity.env.example: Updated with MLX configuration optionsTechnical Details
abalogh/Tongyi-DeepResearch-30B-A3B-4bit(~17GB, fits in 32GB RAM)/v1/chat/completionsRequirements
Usage
Testing
Tested on M1 Max with 32GB RAM:
<think>reasoning blocksRelated
This complements PR #219 which adds Exa.ai for web search.