Add MLX support for Apple Silicon inference #220

chindris-mihai-alexandru · 2025-11-26T22:15:03Z

Summary

This PR adds support for running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using Apple's MLX framework instead of CUDA/vLLM.

Changes

inference/run_mlx_react.py: New React agent runner that uses MLX-lm server's OpenAI-compatible API
inference/run_mlx_infer.sh: Shell script to start MLX server and run inference
inference/test_mlx_connection.py: Test script to verify MLX server connectivity
.env.example: Updated with MLX configuration options

Technical Details

Uses the 4-bit quantized model abalogh/Tongyi-DeepResearch-30B-A3B-4bit (~17GB, fits in 32GB RAM)
MLX-lm provides an OpenAI-compatible server at /v1/chat/completions
Graceful fallback for tools that have Python 3.14 compatibility issues

Requirements

pip install mlx-lm

Usage

# Start MLX server and run inference
bash inference/run_mlx_infer.sh

# Or manually:
# Terminal 1: Start server
mlx_lm.server --model abalogh/Tongyi-DeepResearch-30B-A3B-4bit --port 8080 --trust-remote-code

# Terminal 2: Run inference
python inference/run_mlx_react.py --dataset eval_data/sample_questions.jsonl --output ./outputs

Testing

Tested on M1 Max with 32GB RAM:

Model loads in ~20 seconds
Inference works correctly with <think> reasoning blocks
Tools (search, visit, google_scholar) are called appropriately

Exa.ai provides AI-native neural search which offers significant advantages for research agents: - Semantic understanding: Finds relevant results based on meaning, not just keyword matching - Query optimization: Built-in autoprompt improves query quality - Direct content retrieval: Can fetch full page text in a single call - Better for complex queries: Neural embeddings excel at nuanced research questions This change simplifies the codebase by removing the dual search provider system and standardizing on Exa.ai.

- Add category parameter to filter results (research paper, news, github, etc.) - Add AI-generated highlights for better content extraction - Include author information in search results - Document all available Exa categories in docstrings

…r handling

- Add run_mlx_react.py: React agent runner using MLX-lm server - Add run_mlx_infer.sh: Shell script to start MLX server and run inference - Add test_mlx_connection.py: Test script to verify MLX server connectivity - Update .env.example with MLX configuration options Enables running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using the MLX framework instead of CUDA/vLLM. Uses the 4-bit quantized model (abalogh/Tongyi-DeepResearch-30B-A3B-4bit, ~17GB) which fits in 32GB RAM. Tested on M1 Max with 32GB RAM - model loads and inference works correctly.

…ndling The MLX OpenAI-compatible server was not applying the chat template correctly, causing tool calls to fail. This change: - Uses native mlx_lm.load() and mlx_lm.generate() Python API directly - Builds chat prompts with proper Qwen format (<|im_start|>...<|im_end|>) - Uses make_sampler() for temperature/top_p settings (new mlx-lm API) - Removes server dependency - model is loaded directly in Python - Adds test_mlx_tool_loop.py for debugging tool call issues - Simplifies run_mlx_infer.sh (no server startup needed) Tested successfully: model now generates proper <tool_call> tags and the agent loop executes search/visit tools correctly.

…g, and timeout protection

… visit fallbacks - Add loop detection to break infinite tool call cycles - Track consecutive errors and force answer after 3 failures - Inject reminder at round 5 to encourage timely conclusions - Rewrite visit tool with raw content fallback when summarization unavailable - Add explicit answer behavior guidelines to system prompt - Create interactive CLI (interactive.py) for normal usage - Simplify tool descriptions for clarity

- Copy tool_args before passing to tools to prevent mutation - Remove unused imports (ThreadPoolExecutor, as_completed) - Add URL validation to reject invalid URLs early - Verified mlx-lm generate() API usage is correct

chindris-mihai-alexandru · 2025-11-28T14:37:00Z

Closing in favor of PR #222 which uses llama.cpp instead of MLX.

llama.cpp is more mature and widely used than MLX, and doesn't have the chat template issues I encountered with MLX. The new PR provides the same local inference capability with better stability.

chindris-mihai-alexandru added 9 commits November 26, 2025 23:19

Polish Exa search: remove unused import, add rate limit and auth erro…

8fda6ec

…r handling

feat: improve MLX runner with graceful shutdown, proper token countin…

07174a4

…g, and timeout protection

fix: use new MLX memory API to avoid deprecation warnings

fca16c2

fix: prevent tool_args mutation and add URL validation

bff6b81

- Copy tool_args before passing to tools to prevent mutation - Remove unused imports (ThreadPoolExecutor, as_completed) - Add URL validation to reject invalid URLs early - Verified mlx-lm generate() API usage is correct

chindris-mihai-alexandru mentioned this pull request Nov 28, 2025

Add llama.cpp local inference support for Mac/local users #222

Open

chindris-mihai-alexandru closed this Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MLX support for Apple Silicon inference #220

Add MLX support for Apple Silicon inference #220

Uh oh!

chindris-mihai-alexandru commented Nov 26, 2025

Uh oh!

chindris-mihai-alexandru commented Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add MLX support for Apple Silicon inference #220

Add MLX support for Apple Silicon inference #220

Uh oh!

Conversation

chindris-mihai-alexandru commented Nov 26, 2025

Summary

Changes

Technical Details

Requirements

Usage

Testing

Related

Uh oh!

chindris-mihai-alexandru commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chindris-mihai-alexandru commented Nov 28, 2025 •

edited

Loading