GitHub - Geekgineer/needle-rs: 258 KB WASM runtime for Needle a 26M-parameter tool-calling transformer. Runs in browser, Cloudflare Workers, and Node.js. No backend required.

Quick start · How it works · Parity · Upstream model

A pure-Rust + WebAssembly runtime for Needle by Cactus Compute — a 26M-parameter transformer that maps (query, tool list) → JSON call in one forward pass. Deploys to browsers, edge workers, CLIs, Python, and no_std embedded targets. No server, no API key.

Tool calling usually means either a paid API call or hundreds of megabytes on disk. needle-rs ships the whole agent in 23 MB.

Stack	Deploy size	Latency	Cost	Privacy
OpenAI function calling	SDK + hosted API	300–800 ms	$ per token	leaves device
llama.cpp + 1B local model	700 MB+	varies	free	local
ONNX Runtime Web + a model	8 MB + your model	varies	free	local
`needle-rs` + Needle	258 KB + 22 MB	~280 ms	free	local

The same routing accuracy — at a fraction of the footprint.

needle-rs is a community Rust + WASM runtime for Needle by Cactus Compute. The model architecture, training procedure, dataset, and weights are entirely their work, released under MIT. This project provides a deployment layer for contexts the official Python implementation cannot reach: browsers, edge workers, embedded systems, and binary-distribution use cases. If you build with this, you are building on Cactus's Needle — please credit them.

Browser / Node.js — npm install needle-rs

import init, { NeedleWasm } from "needle-rs";

await init();
const engine = NeedleWasm.load(weights, vocab);
engine.run("Book a flight from London to JFK tomorrow", toolsJson);
// → {"name":"book_flight","arguments":{...}}

Rust — cargo add needle-infer

use needle_infer::NeedleEngine;

let engine = NeedleEngine::load("needle.safetensors", "vocab.txt")?;
let result = engine.run(query, tools_json);
println!("{}", result.text);

Python — pip install needle-rs

from needle_rs import NeedleEngine

engine = NeedleEngine.load("needle.safetensors", "vocab.txt")
result = engine.run(query, tools_json)
# → [{"name":"book_flight","arguments":{...}}]

Get the weights

huggingface-cli download Abdalrahman/needle-rs-safetensors \
  needle.safetensors vocab.txt --local-dir weights/

Or load directly from a URL in the browser — no install step.

Target	Status	Binary
Browser _(WASM)	✓	`258 KB`
Node.js _(WASM)	✓	`258 KB`
Cloudflare Workers	✓	`258 KB`
Linux / macOS / Windows CLI	✓	`533 KB`
Python _{(native wheel)}	✓	`pip install needle-rs`
C / C++ / Go / Swift _{(via FFI)}	✓	`557 KB`
`no_std` embedded (Rust)	✓	_{size varies}
iOS / Android _{(use Cactus)}	_—	_—
Apple NPU / Snapdragon NPU _{(use Cactus)}	_—	_—

Cactus's official engine targets mobile and NPUs with hand-tuned ARM SIMD. needle-rs targets everywhere else. The two stacks are complementary.

Needle is a 26M-parameter encoder-decoder transformer with a small twist: it's trained to do exactly one thing — emit a function-call JSON object from a query and a tool list. That focus is why a model this small works at all.

₁	Encoder–decoder SAN. The encoder reads the query and tool definitions once. The decoder generates output JSON token by token, attending to the encoder's cached KV. Single forward pass per call.
₂	INT4 quantization. All attention and FFN weights are packed 4-bit nibbles with per-32-row scales. Matvec dequantizes on the fly — the full f32 weight matrix is never materialized. AVX2 on x86_64, NEON on aarch64, scalar fallback for WASM.
₃	Constrained decoding. A character-level trie over valid tool names and argument keys, plus a three-state JSON machine, masks logits at every step. Output is always syntactically valid JSON pointing at a real tool — never a hallucinated function name, never broken syntax.
₄	Two schema formats. Accepts both the flat `{"location": {"type": "string"}}` style and OpenAI's `{"type":"object","properties":{...}}` style. The Python reference handles only the flat form.
₅	Greedy by design. Tool calling is a routing task, not a generation task — temperature would only introduce errors. `needle-rs` is argmax-only and intentionally does not support stochastic sampling.

Architecture deep-dive: ARCHITECTURE.md.

A common failure mode for from-scratch model reimplementations is silent drift — outputs that look right but diverge at the third decimal place, with rare and untraceable downstream bugs. needle-rs rejects that. The Rust engine is required to produce the exact same token ID sequence as the Python/JAX reference on every input, at every decode step.

The test suite generates 560 inference examples by running the Python model on a diverse input set: five tool-name conventions _{(snake_case, camelCase, PascalCase, UPPER_SNAKE_CASE, kebab-case)}, parameter counts from 0 to 8, tool arrays from 1 to 20 entries, and a spread of natural-language query phrasings. For each example we capture the Python model's complete output token sequence. The Rust engine is then run on every example and required to produce the identical sequence.

560 / 560 pass. Not approximately — same argmax decision at every step.

Token-exact parity is checked on every CI run. Any change that drifts gets caught before merge. The reference vectors are committed to the repo, so the parity contract is version-pinned and reproducible without re-running Python:

Reference generator: tools/gen_e2e_vectors.py
Reference data: tests/e2e_vectors.json
Rust parity test: crates/needle-infer/tests/e2e_parity.rs

Plus 55 unit tests on the constrained decoder covering edge cases the parity suite doesn't reach _{(empty tool arrays, parameter-less tools, name-collision under snake_case normalization, max-length queries)}.

JavaScript / TypeScript

engine.run(query, tools)                              // → string
engine.run_stream(query, tools, (id, piece) => {})    // per-token callback → final string
engine.run_batch([{ query, tools }, ...])             // → string[]
engine.encode_contrastive(text)                       // → Float32Array | null
engine.retrieve_tools(query, descriptionsJson, topK)  // semantic tool routing

Rust

engine.run(query, tools_json);
engine.run_stream(query, tools_json, |_id, piece| print!("{piece}"));
engine.run_batch(&[(q1, t1), (q2, t2)]);
engine.encode_contrastive(text);            // → Option<Vec<f32>>
engine.retrieve_tools(query, descs, k);     // → Vec<(usize, f32)>

C (and anything with FFI)

#include "needle.h"

NeedleHandle h  = needle_load("needle.safetensors", "vocab.txt");
const char *out = needle_run(h, query, tools_json);
printf("%s\n", out);
needle_free_str((char *)out);
needle_free(h);

Full header: crates/needle-c/include/needle.h. Null-safe throughout; errors via thread-local needle_last_error().

Intel i7-1185G7 (Tiger Lake, 4-core), Linux, release build, median of 5 runs.

End-to-end (load + infer)	`283 ms`
Warm inference only	`~80 ms`
INT4 matvec 512×512 (AVX2)	`83 µs · 3.2 Gelem/s`
INT4 matvec 2048×512 (AVX2)	`311 µs · 3.1 Gelem/s`

Apple Silicon NEON path is implemented but unbenchmarked — M-series numbers welcome via PR.

Footprint, stripped release:

WASM module	`258 KB`
CLI binary	`533 KB`
C shared library	`557 KB`
Weights (INT4 SafeTensors)	`22 MB`
Runtime dependencies	`1` _{(libm; WASM adds wasm-bindgen)}

Full methodology and raw numbers: BENCHMARKS.md.

✓ Browser-side intent routing Decide which API to call before making the network request. Sub-second, zero servers.

✓ Edge function dispatch Tool calling inside Cloudflare Workers, Vercel Edge, Deno Deploy — anywhere with a WASM runtime.

✓ On-device privacy User queries never leave the browser tab. Useful for healthcare, legal, and any context where sending text to OpenAI is a non-starter.

✓ Embedded agents no_std core means the kernels run on microcontrollers with enough RAM for the weights.

What it's not good for: open-ended chat, long-context reasoning, anything where you'd reach for a >1B-parameter model. Needle is a router, not a generalist.

Needle is designed and trained by Henry Ndubuaku and the Cactus Compute team. The model architecture, training code, dataset, and weights are entirely their work, released under MIT. needle-rs is an independent Rust runtime — no upstream code is copied, only the published architecture is implemented.

If you find this useful, please star the upstream Needle repo as well.

@software{needle2026,
  author  = {Ndubuaku, Henry and {Cactus Compute}},
  title   = {Needle: A 26M-Parameter Tool-Calling Transformer},
  year    = {2026},
  url     = {https://github.com/cactus-compute/needle},
  license = {MIT}
}

@software{needlers2026,
  author  = {Ibrahim, Abdalrahman},
  title   = {needle-rs: Pure-Rust WASM Runtime for Needle},
  year    = {2026},
  url     = {https://github.com/geekgineer/needle-rs},
  license = {MIT}
}

_{MIT — see LICENSE. Model and weights by Cactus Compute, also MIT.}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.cargo		.cargo
.github		.github
assets		assets
crates		crates
docs		docs
examples		examples
tests		tests
tools		tools
.gitignore		.gitignore
ACHIEVEMENTS.md		ACHIEVEMENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages