A Model Context Protocol server that exposes Apple's on-device AI stack — Foundation Models, Vision, Natural Language, Speech, and Sound Analysis — as 21 tools any MCP-speaking client can call (Claude Desktop, OpenAI, Gemini, Codex, Hermes, …).
Everything runs 100% on-device. No API keys, no cloud round-trips, no data leaves your Mac.
Cloud LLM tokens are expensive for high-volume deterministic work (translation, summarization, OCR, transcription). Apple Silicon Macs ship a capable on-device AI stack — Foundation Models, Vision, Speech — but only if you write Swift. This server wraps that stack as a single MCP endpoint so any host LLM (Claude, GPT, Gemini) can offload bulk work to your Mac instead of burning tokens.
Concretely it lets a host model say "OCR this image", "transcribe this audio", "polish this Discord reply", "summarize this meeting log" — and the work happens locally in milliseconds, free.
- Discord / chat copilot
proofread_text,rewrite_text(tone="professional"),summarize_textpreserve@mentions,:emoji:, code fences, and the input language. - Document workflow
vision_analyze(mode="ocr")→generate_text_structured(schema="extract")→generate_text_structured(schema="summarize")to turn a scanned PDF or photo into structured fields plus a summary. - Voice-message pipeline
transcribe_audio→summarize_text→synthesize_speechbuilds a full "spoken-in / spoken-out" loop without leaving the device. - Image cataloging
vision_analyze(mode="classify"/"aesthetics"/"document")plusimage_similarityfor local-photo organization. - Privacy-sensitive transcription / translation Legal, medical, HR contexts where audio or text must not leave the machine.
- Token-cost optimization for AI clients Push translation / bulk rewrite / sentiment classification to the local model via the recommended host system prompt below, reserve cloud tokens for reasoning-heavy work.
- Apple Silicon Mac (M1 or later)
- macOS 26 (Tahoe) or later
- Apple Intelligence enabled (System Settings → Apple Intelligence & Siri)
- Full Xcode (Command Line Tools alone don't ship the FoundationModels macros)
- Homebrew + Python 3.10+ (
brew install python3)
git clone https://github.com/falll2000/apple-intelligence-mcp.git
cd apple-intelligence-mcp
bash install.shThe script will:
- Compile the Swift Core Service (release build,
swift build -c release) - Create a Python venv and install
mcp(FastMCP) - Register the server as a launchd agent (
com.apple-intel-mcp.server) on port 11435 - Print the exact config snippet for your AI client
Claude Desktop (stdio) — edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"apple-intelligence": {
"command": "/path/to/apple-intelligence-mcp/mcp-server/venv/bin/python3",
"args": ["/path/to/apple-intelligence-mcp/mcp-server/server.py", "--stdio"]
}
}
}install.sh prints the absolute paths for your machine. Copy-paste them.
Other clients (HTTP) — the HTTP server starts at login via launchd:
http://127.0.0.1:11435/mcp
┌────────────────────────────────────────────┐
│ AI Client (Claude / GPT / etc.) │
└──────────────────┬─────────────────────────┘
│ MCP protocol
│ (stdio OR streamable-http :11435)
▼
┌────────────────────────────────────────────┐
│ Python FastMCP server │
│ mcp-server/server.py │
│ - 21 @mcp.tool definitions │
│ - SwiftBridge: persistent subprocess + │
│ async lock + JSON line protocol │
└──────────────────┬─────────────────────────┘
│ stdin/stdout JSON lines
│ (IPCRequest / IPCResponse)
▼
┌────────────────────────────────────────────┐
│ Swift Core Service (long-lived process) │
│ swift-core/AppleIntelCore │
│ - CoreService.swift (request router) │
│ - per-domain handlers (see modules) │
│ - Apple frameworks loaded once on launch │
└──────────────────┬─────────────────────────┘
│
▼
FoundationModels ←─ on-device LLM (~3B)
Vision ←─ 18 image / pose tasks
NaturalLanguage ←─ tokenize / NER / POS …
Speech ←─ offline STT
AVFoundation ←─ offline TTS
SoundAnalysis ←─ audio classification
Why two processes? FastMCP is Python-native; Apple AI frameworks are
Swift-only. The Swift binary stays resident so frameworks (which take seconds
to initialize) load once. The Python layer is thin — it handles MCP protocol,
schema/description, and serialization. Each await bridge.call(...) writes one
JSON line to stdin, reads one JSON line from stdout, under an asyncio.Lock
to keep the request/response stream serialized.
swift-core/Sources/AppleIntelCore/ is split one handler per Apple-framework
concern. Adding a new tool follows a predictable pattern:
main.swift ← entry point (await CoreService.run())
Models.swift ← IPCRequest / IPCResponse / JSONValue
HandlerError.swift ← typed errors (invalidInput / unavailable / …)
CoreService.swift ← request router — adds a `case "<tool>":` per tool
and forwards to the right handler
GenerateHandler.swift ← FoundationModels:
- generate_text (free-form)
- generate_text_structured (@Generable schemas)
TranslateHandler.swift ← FM-prompt translation w/ per-target-language
instructions (avoids the "model thinks input is
already English" trap on zh→en)
WritingToolsHandler.swift ← FM-prompt proofread / rewrite / summarize:
- NLLanguageRecognizer + CJK ratio routing
- per-language instructions (zh-Hant/zh-Hans/en/ja)
- Discord-aware (preserves @/:emoji:/```fences)
OCRHandler.swift ← Vision text recognition (zh/en/ja/ko)
VisionExtHandler.swift ← Vision: faces, barcodes, contours, text regions,
face landmarks, human bodies, horizon,
segment_foreground, aesthetics, optical_flow,
custom Core ML object detection, image similarity
VisionPoseHandler.swift ← Vision: 2D body pose, hand pose, animals,
rectangles, saliency, document, person segment,
3D body pose (guarded — see Known limits)
AnalyzeHandler.swift ← NL: sentiment, language detection, NER, keywords
NLAdvancedHandler.swift ← NL: tokenize, lemmatize, POS tagging
NLEmbeddingHandler.swift ← NL: word / sentence semantic similarity
TranscribeHandler.swift ← Speech: offline STT (SFSpeechRecognizer)
SpeechSynthHandler.swift ← AVFoundation TTS → .wav file + voice list
SoundHandler.swift ← SoundAnalysis: ambient sound classification
Adding a tool — checklist:
- Pick the matching handler (or create a new one if the framework is new).
- Implement the Swift function — return a value, throw
HandlerErroron bad input. - In
CoreService.swift, add acase "<tool_name>":that decodes params and calls the handler. - In
mcp-server/server.py, add an@mcp.tool()function with WHEN/NOT-FOR docstring and anawait bridge.call("<tool_name>", {...}). - Rebuild Swift (
swift build -c release), restart MCP (launchctl kickstart -k gui/$UID/com.apple-intel-mcp.server). - Document in this README +
README.zh-Hant.md.
The 18 single-image Vision capabilities are routed through one tool
(vision_analyze) with a mode parameter, instead of 18 individual tools —
this measurably improves host-LLM tool-selection accuracy.
| Tool | Description |
|---|---|
generate_text |
General text generation / rewriting |
generate_text_structured |
Guided generation — guaranteed JSON. Schemas: list / classify / summarize / extract / qa (each has its own prompt-quality guidance in the tool description) |
translate_text |
Translation between zh-Hant / zh-Hans / en / ja / ko / fr / de / es. Uses per-target-language instructions |
proofread_text |
Fix typos / grammar / punctuation in user-supplied text. Preserves tone, language, and Discord syntax (@mentions, :emoji:, code blocks) |
rewrite_text |
Rewrite in a different tone (formal / casual / concise / friendly / professional) while preserving meaning, language, and Discord syntax |
summarize_text |
Condense text to short / medium / long prose. Same-language in/out (zh→zh, en→en) |
| Tool | Description |
|---|---|
vision_analyze |
18-task router. mode ∈ {ocr, classify, faces, face_landmarks, barcodes, text_regions, contours, human_bodies, rectangles, horizon, saliency, document, segment_person, segment_foreground, aesthetics, body_pose, hand_pose, animals} |
image_similarity |
Visual similarity score between two image files (Vision feature print L2 distance, thresholds tuned 0.1 / 0.4 / 0.8) |
detect_optical_flow |
Per-pixel motion vectors between two frames |
detect_trajectories |
Parabolic trajectory detection on a local video file |
detect_objects |
Object detection with a user-supplied Core ML model (.mlmodel / .mlmodelc) |
| Tool | Description |
|---|---|
analyze_text |
Sentiment + language detection + NER + keywords |
tokenize_text |
Split into words / sentences / paragraphs (multilingual; correctly segments Chinese) |
tag_parts_of_speech |
POS tagging |
lemmatize_text |
Reduce words to base form (running → run) |
word_similarity |
Semantic similarity between two words (0–1) |
sentence_similarity |
Semantic similarity between two sentences (0–1) |
| Tool | Description |
|---|---|
transcribe_audio |
Offline STT (zh-TW / zh-CN / en-US / ja-JP / …). Punctuation + dictation hints enabled |
synthesize_speech |
Offline TTS via AVSpeechSynthesizer → .wav (zh-TW Meijia by default) |
list_voices |
Discover voice identifiers, filterable by BCP-47 prefix |
classify_sound |
Classify ambient audio (music, laughter, dog bark, …). Needs ≥ 3 s input |
The host model decides whether to call these tools based on its system prompt
plus the tool descriptions. The server uses WHEN: / NOT FOR: descriptions to
help, but the host needs an explicit policy too. Paste the following into your
client's system prompt for reliable routing:
You have access to an `apple-intelligence` MCP server that runs entirely on the
user's Mac. You MUST prefer it for the following task types instead of doing
the work yourself:
- User provides an absolute path to an image file → call `vision_analyze`
with the appropriate mode. Do NOT describe the image yourself first.
- User provides an absolute path to an audio file and wants the words →
call `transcribe_audio`.
- User asks for tokenization or lemmatization → call the matching tool.
- User asks for sentiment classification → call
`generate_text_structured(schema="classify")` (works for Chinese too,
unlike `analyze_text` which is English-only).
- User asks to compare two images → `image_similarity`.
- User asks to read text aloud → call `synthesize_speech` and attach
the returned `.wav` path to the response.
- User has already-written text and asks to "check / fix typos /
proofread" it → call `proofread_text` (NOT `generate_text`).
- User has already-written text and asks to make it "formal / casual /
shorter / friendlier / more professional" → call `rewrite_text` with
the matching `tone`.
- User has long text and asks to "summarize / TL;DR / shorten" → call
`summarize_text`. Use `generate_text_structured(schema="summarize")`
only when the caller needs JSON with `title` + `keyPoints[]`.
You MAY use it (caller's discretion) for:
- Bulk text rewriting / translation where token cost matters more than nuance
→ `generate_text`, `translate_text`, `generate_text_structured`.
You should NOT use it for:
- Tasks needing strong reasoning, code, math, or current-events knowledge —
the on-device model is small. Use your own generation.
Apple's frameworks are uneven across languages. Vision, Speech, and FoundationModels handle Chinese well; the older NaturalLanguage and NLEmbedding frameworks are essentially English-only on this stack.
| Tool | zh-Hant / zh-Hans |
|---|---|
vision_analyze (all modes) |
✓ strong |
transcribe_audio |
✓ accurate (Apple model adds commas only, no periods) |
synthesize_speech |
✓ Meijia / Eloquence voices available |
tokenize_text |
✓ proper word segmentation (牛肉麵 stays as one token) |
lemmatize_text |
✓ correctly a no-op (Chinese has no inflection) |
generate_text_structured (classify) |
✓ usable for Chinese sentiment |
translate_text |
✓ zh→en / zh→ja reliable; en→zh uses standard localized brand forms (蘋果商店, 特斯拉); idioms translate literally |
proofread_text |
⚠ language preserved correctly; FM misses some zh grammar errors (一各/再/的-vs-得) and some en subject-verb agreement |
rewrite_text |
✓ language preserved; professional / concise / formal stable; casual / friendly occasionally paraphrases beyond meaning |
summarize_text |
✓ language preserved (zh→zh, en→en); short length sometimes loose |
generate_text |
⚠ short prompts OK; knowledge cutoff ~2023 |
classify_sound |
⚠ language-agnostic but ranking can be off |
analyze_text |
✗ Chinese sentiment always 0/中性, NER misses Chinese entities |
tag_parts_of_speech |
✗ Chinese tags all return as 「其他」 |
word_similarity / sentence_similarity |
✗ no Chinese embedding model |
For Chinese-heavy deployments, exclude the four ✗ tools at the host's MCP
config layer (e.g. hermes' mcp_servers.<name>.tools.exclude) so the host
LLM never tries to route Chinese requests to them.
Foundation Models safety filter — generate_text and related tools may
error on certain content. The filter is enforced inside the on-device model,
not by this server. Even innocuous body-related characters (e.g. 「胖」 in a
brand name) can trip it. Use generate_text_structured for content that
might trigger it.
detect_objects requires a user-supplied Core ML model (.mlmodel or
.mlmodelc). All other tools work out of the box.
detect_trajectories requires a video file (mp4/mov). Works best with
footage of objects following a parabolic path (sports, balls).
body_pose_3d is removed from the public mode list.
VNDetectHumanBodyPose3DRequest terminates the Swift Core process with an
uncaught Objective-C exception during perform, before Swift can catch it.
The Swift case still exists as a safety net (returns unavailable if a stale
client tries) but it's no longer advertised. Use mode="body_pose" for stable
2D pose detection.
Apple Intelligence ceilings — the following macOS 26 APIs look callable in the SDK but are not actually usable from a daemon:
| API | Why blocked |
|---|---|
Writing Tools (NSWritingToolsCoordinator) |
UI-bound (requires NSView) — we provide proofread_text / rewrite_text / summarize_text via Foundation Models instead |
Image Playground (ImageCreator) |
Returns backgroundCreationForbidden even from Terminal — Apple-only entitlement |
| Genmoji | Same path as ImageCreator(style="emoji"), same entitlement block |
| Visual Intelligence | Only AppIntents.AssistantSchemas.VisualIntelligenceIntent — schema-only, no callable API |
| Smart Reply | CSSmartReply is an internal symbol (only in .tbd, no public header) |
Vision runtime tests should run from an Xcode-built binary, Terminal, or
another unsandboxed local process. Sandboxed runners produce false
CVPixelBuffer, ANECF, or request cancelled errors.
install.sh registers a launchd agent that starts at login and auto-restarts
on crash. Manual control:
bash start.sh # bootstrap launchd agent
bash stop.sh # bootout launchd agent
tail -f /tmp/apple-intel-mcp.log # logs
launchctl kickstart -k gui/$UID/com.apple-intel-mcp.server # force restartIf you use hermes and want hermes gateway start/stop/restart to drive the MCP server too:
bash install-hermes-integration.sh # install watchdog
bash uninstall-hermes-integration.sh # remove watchdog (keeps mcp running)This installs a second launchd agent (com.apple-intel-mcp.hermes-watchdog)
that polls every 3 s and mirrors ai.hermes.gateway onto the MCP server:
| Hermes action | MCP reaction (≤ 3 s lag) |
|---|---|
hermes gateway stop |
bootout MCP |
hermes gateway start |
bootstrap MCP |
hermes gateway restart |
kickstart -k MCP (PID change detection) |
The integration is purely additive — MCP runs fine on its own. install.sh
prints a hint if it detects hermes installed.
Implementation note: the watchdog script is copied into
~/Library/Application Support/apple-intel-mcp/at install time, because launchd refuses to execute shell scripts directly from/Volumes/on macOS 26 (TCC blocks it as "Operation not permitted"). The Python venv binary doesn't hit this restriction.
bash uninstall.sh # removes mcp + watchdog (if installed)apple-intelligence-mcp/
├── install.sh / uninstall.sh
├── install-hermes-integration.sh / uninstall-hermes-integration.sh
├── start.sh / stop.sh
├── bin/
│ └── hermes-watchdog.sh # polls ai.hermes.gateway, syncs mcp state
├── mcp-server/
│ ├── server.py # FastMCP server + SwiftBridge (~650 LOC)
│ └── requirements.txt # mcp>=1.0.0
├── swift-core/
│ ├── Package.swift # macOS 26, Swift 6
│ └── Sources/AppleIntelCore/ # ~2,500 LOC, one handler per framework
│ ├── main.swift # entry point
│ ├── CoreService.swift # request router
│ ├── Models.swift # IPC types
│ ├── HandlerError.swift # typed errors
│ ├── GenerateHandler.swift # Foundation Models
│ ├── TranslateHandler.swift # FM translation
│ ├── WritingToolsHandler.swift # proofread/rewrite/summarize
│ ├── OCRHandler.swift # Vision OCR
│ ├── VisionExtHandler.swift # Vision detect tools
│ ├── VisionPoseHandler.swift # Vision pose / motion
│ ├── AnalyzeHandler.swift # NL sentiment/NER/keywords
│ ├── NLAdvancedHandler.swift # NL tokenize/POS/lemma
│ ├── NLEmbeddingHandler.swift # NL similarity
│ ├── TranscribeHandler.swift # Speech STT
│ ├── SpeechSynthHandler.swift # AVFoundation TTS
│ └── SoundHandler.swift # SoundAnalysis
└── test-assets/ # sample images for testing
MIT