A native, on-device Swift port of LiquidAI's ColBERT Tool Selection Space — the same demo, running entirely on Apple Silicon with no server.
An agent with 151 tools can't fit them all in one prompt. Type a request and an LFM2.5 retriever pre-selects the 5 most relevant tools, so you hand a small candidate set to the LLM instead of dumping every schema into the context window.
The app indexes 151 tool definitions spanning 7 domains (e-commerce, devops, travel, support, and more). For each request it scores every tool and shows the top 5 — the candidate set you would route to an LLM instead of all 151 schemas.
Two retrievers, switchable in the UI:
- Embedding —
LFM2.5-Embedding-350M: one CLS-pooled vector per tool, ranked by cosine similarity. - ColBERT —
LFM2.5-ColBERT-350M: one vector per token, ranked by MaxSim late interaction. This is the default, matching the original Space.
Everything runs in-process on the device:
- Tokenize the request (via swift-transformers).
- Encode the query and each tool's routing text with LFM2.5 — the encoder forward runs on the Apple Silicon GPU through MLX-Swift, using mlx-swift-lm's
MLXEmbedders(which gained native LFM2.5 bidirectional-encoder support for this app). - Score on the CPU with Accelerate — a single BLAS call per query over flat, L2-normalized vectors:
cblas_sgemvfor embedding cosine,cblas_sgemm+vDSP_maxvfor ColBERT MaxSim. - Rank and show the top 5.
The tool index is built once when a model loads; each keystroke only re-encodes the query, so a search takes tens of milliseconds (the per-query latency is shown in the UI).
Models are downloaded from the Hugging Face Hub on first run and cached locally; a Precision toggle picks the on-device weights. The bf16 checkpoints are ~680–690 MB each. We ran a quant-aware retrieval eval to choose the quantized ship precisions:
| Retriever | Quantized | Size | NDCG@10 retention vs bf16 |
|---|---|---|---|
| Embedding-350M | int4 | 195 MB (3.5× smaller) | 100.0% |
| ColBERT-350M | int8 | 363 MB (1.9× smaller) | 100.0% |
| ColBERT-350M | int4 | ~195 MB (3.5× smaller) | 98.7% |
Retention is NDCG@10 against the bf16 baseline, measured over NanoBEIR (English) plus MIRACL dev judged pools (Spanish / German / Japanese / Arabic).
- Embedding tolerates int4 losslessly — CLS pooling averages out quantization noise, so int4 is the recommended precision (3.5× smaller, no measurable retrieval loss, even on Japanese and Arabic).
- ColBERT's per-token MaxSim is more quant-sensitive — int8 is lossless and the safe default; int4 retains 98.7% (loss concentrated in English NQ, ~95%) and is the aggressive option when size-bound.
A useful gotcha from the eval: raw embedding cosine drifts more under int4 (~0.96 vs bf16), but rank metrics ignore those small magnitude shifts — which is why retrieval stays ~lossless. Always measure quantization impact on the task metric (retrieval), not on vector cosine.
- Apple Silicon Mac (macOS 14+) or an iOS 17+ device
- Xcode 16+
- Open
SmartToolSelection.xcodeproj. - Select the SmartToolSelection scheme and run.
- On first launch the app downloads the selected model from Hugging Face into
~/Library/Application Support/SmartToolSelection/models/(the full path is printed to the console). Later launches load from that cache.
Pick a backend (Embedding / ColBERT) and precision (bf16 / int8 / int4) at the top of the window; switching either downloads the matching model as needed and rebuilds the index.
- Models — LiquidAI LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, converted to MLX (model repositories).
- Original demo — LiquidAI's ColBERT Tool Selection Space.
- On-device inference — MLX-Swift and mlx-swift-lm.

