The Local-First Mobile LLM Infrastructure. Zero friction. Pure Rust. Native Edge AI.
Warning
Active Development β Not Production Ready
The Rust core (hardware profiler, resumable downloader, GGUF inference engine, TalaDB RAG embedder) is implemented. The React Native JSI bridge and TypeScript API surface are still being wired up. APIs are unstable and subject to change. Contributions and early feedback are very welcome β star the repo to follow progress.
Running Small Language Models (SLMs) like Gemma directly on mobile devices is the future of privacy-first, zero-latency applications. But today, the Developer Experience (DX) is fundamentally broken:
- The App Store Trap: Bundling a 1.5 GB+
.ggufquantized model directly into an app binary destroys user acquisition and violates App Store cellular download limits. - The C++ Boilerplate: React Native and Flutter developers are forced to wrestle with complex C++ wrappers, asynchronous bridging overhead, and memory leaks just to stream tokens.
- The RAG Fragmentation: Building offline Retrieval-Augmented Generation (RAG) requires developers to manually stitch together text chunkers, separate embedding models, and local vector databases.
| Layer | Description |
|---|---|
| π¦ Pure Rust Core | Memory-safe GGUF inference via llama-cpp-2 β Metal GPU on iOS/macOS, NEON on Android ARM64, CPU fallback everywhere else |
| β‘ Zero-Overhead Bridging | Direct JSI for React Native, flutter_rust_bridge FFI for Flutter β per-token streaming via C-ABI callbacks, no async bottlenecks |
| π¦ Shell & Fetch Delivery | Resumable background downloader with HTTP Range support and SHA-256 integrity verification. Device hardware is profiled before any download to prevent OOM crashes. Initial app binary stays < 50 MB |
| π§ Plug-and-Play Local RAG | Optional TalaDB plugin: auto-chunks text, runs all-MiniLM-L6-v2 locally for 384-dim embeddings, persists vectors in TalaDB's HNSW index β Rust-to-Rust, zero serialisation overhead |
lokal-ml/
βββ packages/
β βββ lokal-ml-core/ # π¦ Rust: hardware profiler, resumable downloader, GGUF engine
β βββ lokal-ml-taladb/ # π¦ Rust: text chunker, MiniLM embedder, TalaDB vector injector
β βββ lokal-ml-react-native/ # π± React Native JSI bridge + TypeScript API
β β βββ rust/ # C-ABI FFI layer (cbindgen β lokal-ml.h)
β βββ lokal-ml-taladb-plugin/ # π @lokal-ml/taladb-plugin TypeScript wrapper
βββ registry/
βββ models.json # Model manifest (URLs, SHA-256, min RAM requirements)
The registry ships 9 models across three device tiers. Gemma 4 E-series is our top recommendation β these are Google's purpose-built edge models with 128K context and multimodal support, tuned specifically for on-device deployment.
Note: Gemma and MedGemma models require accepting Google's terms on HuggingFace before the download URL resolves. Qwen3 models are fully open.
| Model ID | Active / Total Params | File Size | Min RAM | Notes |
|---|---|---|---|---|
gemma4-e2b |
2.3B / 5B | 3.46 GB Q4_K_M | 5 GB | Best overall. Any-to-any, 128K ctx |
gemma4-e4b |
4.5B / 8B | 5.41 GB Q4_K_M | 7 GB | Best quality. Any-to-any, 128K ctx |
"Any-to-any" = text + image + audio input. Both require accepting Google's Gemma terms on HuggingFace.
| Model ID | Params | File Size | Min RAM | Notes |
|---|---|---|---|---|
gemma3-4b |
4B | 3.16 GB QAT Q4_0 | 4.5 GB | Official Google quant. Multimodal* |
medgemma-4b |
4B | 2.49 GB Q4_K_M | 3.5 GB | Medical text + image. 128K ctx |
qwen3-4b |
4B | 2.50 GB Q4_K_M | 3.5 GB | 256K ctx. No HF token needed |
* Gemma 3 4B vision requires a separate mmproj GGUF (851 MB); the registry URL is text-only.
| Model ID | Params | File Size | Min RAM | Notes |
|---|---|---|---|---|
gemma3-1b |
1B | 1.00 GB QAT Q4_0 | 1.5 GB | Official Google quant. Ultra-fast |
qwen3-1.7b |
1.7B | 1.83 GB Q8_0 | 2.5 GB | 256K ctx. No HF token needed |
qwen3-0.6b |
0.6B | 639 MB Q8_0 | 1 GB | Smallest chat model. No HF token |
| Model ID | Size | Notes |
|---|---|---|
all-minilm-l6-v2 |
~23 MB | 384-dim vectors for TalaDB RAG |
Pass an absolute path instead of a model ID to load any GGUF you manage yourself:
const ai = await Lokal.init({ model: '/path/to/your/custom.gguf' });Useful for private fine-tunes, enterprise models, or any quantisation not in the registry. The hardware profiler still runs, but SHA-256 verification is your responsibility.
| Package | Status | Description |
|---|---|---|
@lokal-ml/react-native |
π§ In Development | Core engine β hardware check, model download, GGUF inference |
@lokal-ml/taladb-plugin |
π§ In Development | Optional RAG layer β offline vector memory via TalaDB |
import { Lokal, ModelManager } from '@lokal-ml/react-native';
import { TalaPlugin } from '@lokal-ml/taladb-plugin';
import { openDB } from '@taladb/react-native';
// 1. Profiler prevents OOM crashes on older devices
const canRun = await ModelManager.checkRequirements('gemma-2b-int4');
if (!canRun) {
console.log('Device cannot run local AI β falling back to cloud.');
return;
}
// 2. Resumable background download (Wi-Fi enforced, fires only if not cached)
await ModelManager.downloadModel('gemma-2b-int4', {
requireWifi: true,
onProgress: (p) => setProgress(p),
});
// 3. Connect to local-first storage & initialise engine
const db = await openDB('local_data.db');
const ai = await Lokal.init({
model: 'gemma-2b-int4',
plugins: [new TalaPlugin({ db, collection: 'knowledge_base' })],
});
// 4. Ingest your documents (auto-chunked + auto-embedded locally)
await ai.plugins.TalaRAG.ingest({
data: [{ id: 'policy_1', text: 'Enterprise SLAs require a 2-hour response time...' }],
});
// 5. Stream instantly with embedded RAG context
await ai.chat({
prompt: 'What is the enterprise SLA response time?',
useRAG: true,
onToken: (token) => process.stdout.write(token),
});- β Initial app binary < 50 MB β no model weights bundled
- β Weights fetched post-install via HTTP Range (resumable, survives backgrounding)
- β
requireWifi: trueenforced by default - β Files stored in OS-designated app data directory (excluded from iCloud backup)
Prerequisites: Rust stable, cmake, Node β₯ 18, pnpm β₯ 9
# Clone
git clone https://github.com/thinkgrid-labs/lokal-ml
cd lokal-ml
# Rust workspace (requires cmake for llama.cpp)
cargo fmt --all -- --check
cargo clippy --workspace -- -D warnings
cargo check --workspace
cargo test --workspace
# JS packages
pnpm install
pnpm typecheckCI runs fmt, clippy, check, and test on every push via GitHub Actions, plus cross-compilation checks for aarch64-apple-ios and the three primary Android ABIs (aarch64, armv7, x86_64).
MIT β Β© 2026 thinkgrid-labs
