Skip to content

Conversation

@chindris-mihai-alexandru

Summary

This PR adds support for running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using Apple's MLX framework instead of CUDA/vLLM.

Changes

  • inference/run_mlx_react.py: New React agent runner that uses MLX-lm server's OpenAI-compatible API
  • inference/run_mlx_infer.sh: Shell script to start MLX server and run inference
  • inference/test_mlx_connection.py: Test script to verify MLX server connectivity
  • .env.example: Updated with MLX configuration options

Technical Details

  • Uses the 4-bit quantized model abalogh/Tongyi-DeepResearch-30B-A3B-4bit (~17GB, fits in 32GB RAM)
  • MLX-lm provides an OpenAI-compatible server at /v1/chat/completions
  • Graceful fallback for tools that have Python 3.14 compatibility issues

Requirements

pip install mlx-lm

Usage

# Start MLX server and run inference
bash inference/run_mlx_infer.sh

# Or manually:
# Terminal 1: Start server
mlx_lm.server --model abalogh/Tongyi-DeepResearch-30B-A3B-4bit --port 8080 --trust-remote-code

# Terminal 2: Run inference
python inference/run_mlx_react.py --dataset eval_data/sample_questions.jsonl --output ./outputs

Testing

Tested on M1 Max with 32GB RAM:

  • Model loads in ~20 seconds
  • Inference works correctly with <think> reasoning blocks
  • Tools (search, visit, google_scholar) are called appropriately

Related

This complements PR #219 which adds Exa.ai for web search.

Exa.ai provides AI-native neural search which offers significant
advantages for research agents:

- Semantic understanding: Finds relevant results based on meaning,
  not just keyword matching
- Query optimization: Built-in autoprompt improves query quality
- Direct content retrieval: Can fetch full page text in a single call
- Better for complex queries: Neural embeddings excel at nuanced
  research questions

This change simplifies the codebase by removing the dual search
provider system and standardizing on Exa.ai.
- Add category parameter to filter results (research paper, news, github, etc.)
- Add AI-generated highlights for better content extraction
- Include author information in search results
- Document all available Exa categories in docstrings
- Add run_mlx_react.py: React agent runner using MLX-lm server
- Add run_mlx_infer.sh: Shell script to start MLX server and run inference
- Add test_mlx_connection.py: Test script to verify MLX server connectivity
- Update .env.example with MLX configuration options

Enables running DeepResearch on Apple Silicon Macs (M1/M2/M3/M4) using
the MLX framework instead of CUDA/vLLM. Uses the 4-bit quantized model
(abalogh/Tongyi-DeepResearch-30B-A3B-4bit, ~17GB) which fits in 32GB RAM.

Tested on M1 Max with 32GB RAM - model loads and inference works correctly.
…ndling

The MLX OpenAI-compatible server was not applying the chat template
correctly, causing tool calls to fail. This change:

- Uses native mlx_lm.load() and mlx_lm.generate() Python API directly
- Builds chat prompts with proper Qwen format (<|im_start|>...<|im_end|>)
- Uses make_sampler() for temperature/top_p settings (new mlx-lm API)
- Removes server dependency - model is loaded directly in Python
- Adds test_mlx_tool_loop.py for debugging tool call issues
- Simplifies run_mlx_infer.sh (no server startup needed)

Tested successfully: model now generates proper <tool_call> tags and
the agent loop executes search/visit tools correctly.
… visit fallbacks

- Add loop detection to break infinite tool call cycles
- Track consecutive errors and force answer after 3 failures
- Inject reminder at round 5 to encourage timely conclusions
- Rewrite visit tool with raw content fallback when summarization unavailable
- Add explicit answer behavior guidelines to system prompt
- Create interactive CLI (interactive.py) for normal usage
- Simplify tool descriptions for clarity
- Copy tool_args before passing to tools to prevent mutation
- Remove unused imports (ThreadPoolExecutor, as_completed)
- Add URL validation to reject invalid URLs early
- Verified mlx-lm generate() API usage is correct
@chindris-mihai-alexandru
Copy link
Author

chindris-mihai-alexandru commented Nov 28, 2025

Closing in favor of PR #222 which uses llama.cpp instead of MLX.

llama.cpp is more mature and widely used than MLX, and doesn't have the chat template issues I encountered with MLX. The new PR provides the same local inference capability with better stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant