Skip to content

thinkgrid-labs/lokal-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lokal ML

Lokal ML (@lokal-ml)

The Local-First Mobile LLM Infrastructure. Zero friction. Pure Rust. Native Edge AI.

Warning

Active Development β€” Not Production Ready
The Rust core (hardware profiler, resumable downloader, GGUF inference engine, TalaDB RAG embedder) is implemented. The React Native JSI bridge and TypeScript API surface are still being wired up. APIs are unstable and subject to change. Contributions and early feedback are very welcome β€” star the repo to follow progress.


The Problem

Running Small Language Models (SLMs) like Gemma directly on mobile devices is the future of privacy-first, zero-latency applications. But today, the Developer Experience (DX) is fundamentally broken:

  • The App Store Trap: Bundling a 1.5 GB+ .gguf quantized model directly into an app binary destroys user acquisition and violates App Store cellular download limits.
  • The C++ Boilerplate: React Native and Flutter developers are forced to wrestle with complex C++ wrappers, asynchronous bridging overhead, and memory leaks just to stream tokens.
  • The RAG Fragmentation: Building offline Retrieval-Augmented Generation (RAG) requires developers to manually stitch together text chunkers, separate embedding models, and local vector databases.

Architecture

Layer Description
πŸ¦€ Pure Rust Core Memory-safe GGUF inference via llama-cpp-2 β€” Metal GPU on iOS/macOS, NEON on Android ARM64, CPU fallback everywhere else
⚑ Zero-Overhead Bridging Direct JSI for React Native, flutter_rust_bridge FFI for Flutter β€” per-token streaming via C-ABI callbacks, no async bottlenecks
πŸ“¦ Shell & Fetch Delivery Resumable background downloader with HTTP Range support and SHA-256 integrity verification. Device hardware is profiled before any download to prevent OOM crashes. Initial app binary stays < 50 MB
🧠 Plug-and-Play Local RAG Optional TalaDB plugin: auto-chunks text, runs all-MiniLM-L6-v2 locally for 384-dim embeddings, persists vectors in TalaDB's HNSW index β€” Rust-to-Rust, zero serialisation overhead

Repository Structure

lokal-ml/
β”œβ”€β”€ packages/
β”‚   β”œβ”€β”€ lokal-ml-core/                # πŸ¦€ Rust: hardware profiler, resumable downloader, GGUF engine
β”‚   β”œβ”€β”€ lokal-ml-taladb/              # πŸ¦€ Rust: text chunker, MiniLM embedder, TalaDB vector injector
β”‚   β”œβ”€β”€ lokal-ml-react-native/        # πŸ“± React Native JSI bridge + TypeScript API
β”‚   β”‚   └── rust/                     #    C-ABI FFI layer (cbindgen β†’ lokal-ml.h)
β”‚   └── lokal-ml-taladb-plugin/       # πŸ”Œ @lokal-ml/taladb-plugin TypeScript wrapper
└── registry/
    └── models.json                   # Model manifest (URLs, SHA-256, min RAM requirements)

Model Registry

The registry ships 9 models across three device tiers. Gemma 4 E-series is our top recommendation β€” these are Google's purpose-built edge models with 128K context and multimodal support, tuned specifically for on-device deployment.

Note: Gemma and MedGemma models require accepting Google's terms on HuggingFace before the download URL resolves. Qwen3 models are fully open.

⭐ Recommended β€” Gemma 4 Edge (flagship phones, 5–8 GB RAM)

Model ID Active / Total Params File Size Min RAM Notes
gemma4-e2b 2.3B / 5B 3.46 GB Q4_K_M 5 GB Best overall. Any-to-any, 128K ctx
gemma4-e4b 4.5B / 8B 5.41 GB Q4_K_M 7 GB Best quality. Any-to-any, 128K ctx

"Any-to-any" = text + image + audio input. Both require accepting Google's Gemma terms on HuggingFace.

Compact (mid-range phones, iPad β€” 3.5–5 GB RAM)

Model ID Params File Size Min RAM Notes
gemma3-4b 4B 3.16 GB QAT Q4_0 4.5 GB Official Google quant. Multimodal*
medgemma-4b 4B 2.49 GB Q4_K_M 3.5 GB Medical text + image. 128K ctx
qwen3-4b 4B 2.50 GB Q4_K_M 3.5 GB 256K ctx. No HF token needed

* Gemma 3 4B vision requires a separate mmproj GGUF (851 MB); the registry URL is text-only.

Nano (any modern phone β€” < 2.5 GB RAM)

Model ID Params File Size Min RAM Notes
gemma3-1b 1B 1.00 GB QAT Q4_0 1.5 GB Official Google quant. Ultra-fast
qwen3-1.7b 1.7B 1.83 GB Q8_0 2.5 GB 256K ctx. No HF token needed
qwen3-0.6b 0.6B 639 MB Q8_0 1 GB Smallest chat model. No HF token

Embedding (RAG only)

Model ID Size Notes
all-minilm-l6-v2 ~23 MB 384-dim vectors for TalaDB RAG

Bring your own GGUF

Pass an absolute path instead of a model ID to load any GGUF you manage yourself:

const ai = await Lokal.init({ model: '/path/to/your/custom.gguf' });

Useful for private fine-tunes, enterprise models, or any quantisation not in the registry. The hardware profiler still runs, but SHA-256 verification is your responsibility.


Packages

Package Status Description
@lokal-ml/react-native 🚧 In Development Core engine β€” hardware check, model download, GGUF inference
@lokal-ml/taladb-plugin 🚧 In Development Optional RAG layer β€” offline vector memory via TalaDB

Developer Experience

import { Lokal, ModelManager } from '@lokal-ml/react-native';
import { TalaPlugin } from '@lokal-ml/taladb-plugin';
import { openDB } from '@taladb/react-native';

// 1. Profiler prevents OOM crashes on older devices
const canRun = await ModelManager.checkRequirements('gemma-2b-int4');
if (!canRun) {
  console.log('Device cannot run local AI β€” falling back to cloud.');
  return;
}

// 2. Resumable background download (Wi-Fi enforced, fires only if not cached)
await ModelManager.downloadModel('gemma-2b-int4', {
  requireWifi: true,
  onProgress: (p) => setProgress(p),
});

// 3. Connect to local-first storage & initialise engine
const db = await openDB('local_data.db');
const ai = await Lokal.init({
  model: 'gemma-2b-int4',
  plugins: [new TalaPlugin({ db, collection: 'knowledge_base' })],
});

// 4. Ingest your documents (auto-chunked + auto-embedded locally)
await ai.plugins.TalaRAG.ingest({
  data: [{ id: 'policy_1', text: 'Enterprise SLAs require a 2-hour response time...' }],
});

// 5. Stream instantly with embedded RAG context
await ai.chat({
  prompt: 'What is the enterprise SLA response time?',
  useRAG: true,
  onToken: (token) => process.stdout.write(token),
});

App Store Compliance

  • βœ… Initial app binary < 50 MB β€” no model weights bundled
  • βœ… Weights fetched post-install via HTTP Range (resumable, survives backgrounding)
  • βœ… requireWifi: true enforced by default
  • βœ… Files stored in OS-designated app data directory (excluded from iCloud backup)

Development

Prerequisites: Rust stable, cmake, Node β‰₯ 18, pnpm β‰₯ 9

# Clone
git clone https://github.com/thinkgrid-labs/lokal-ml
cd lokal-ml

# Rust workspace (requires cmake for llama.cpp)
cargo fmt --all -- --check
cargo clippy --workspace -- -D warnings
cargo check --workspace
cargo test --workspace

# JS packages
pnpm install
pnpm typecheck

CI runs fmt, clippy, check, and test on every push via GitHub Actions, plus cross-compilation checks for aarch64-apple-ios and the three primary Android ABIs (aarch64, armv7, x86_64).


License

MIT β€” Β© 2026 thinkgrid-labs

About

The Local-First Mobile LLM Infrastructure. Zero friction. Pure Rust. Native Edge AI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors