Skip to content

Geekgineer/needle-rs


needle-rs demo


A pure-Rust + WebAssembly runtime for Needle by Cactus Compute — a 26M-parameter transformer that maps (query, tool list) → JSON call in one forward pass. Deploys to browsers, edge workers, CLIs, Python, and no_std embedded targets. No server, no API key.


Why this matters

Tool calling usually means either a paid API call or hundreds of megabytes on disk. needle-rs ships the whole agent in 23 MB.

Stack Deploy size Latency Cost Privacy
OpenAI function calling SDK + hosted API 300–800 ms $ per token leaves device
llama.cpp + 1B local model 700 MB+ varies free local
ONNX Runtime Web + a model 8 MB + your model varies free local
needle-rs + Needle 258 KB + 22 MB ~280 ms free local

The same routing accuracy — at a fraction of the footprint.


Quick start

needle-rs is a community Rust + WASM runtime for Needle by Cactus Compute. The model architecture, training procedure, dataset, and weights are entirely their work, released under MIT. This project provides a deployment layer for contexts the official Python implementation cannot reach: browsers, edge workers, embedded systems, and binary-distribution use cases. If you build with this, you are building on Cactus's Needle — please credit them.

Browser / Node.js  —  npm install needle-rs
import init, { NeedleWasm } from "needle-rs";

await init();
const engine = NeedleWasm.load(weights, vocab);
engine.run("Book a flight from London to JFK tomorrow", toolsJson);
// → {"name":"book_flight","arguments":{...}}
Rust  —  cargo add needle-infer
use needle_infer::NeedleEngine;

let engine = NeedleEngine::load("needle.safetensors", "vocab.txt")?;
let result = engine.run(query, tools_json);
println!("{}", result.text);
Python  —  pip install needle-rs
from needle_rs import NeedleEngine

engine = NeedleEngine.load("needle.safetensors", "vocab.txt")
result = engine.run(query, tools_json)
# → [{"name":"book_flight","arguments":{...}}]

Get the weights

huggingface-cli download Abdalrahman/needle-rs-safetensors \
  needle.safetensors vocab.txt --local-dir weights/

Or load directly from a URL in the browser — no install step.


Where it runs

Target Status Binary
Browser (WASM)258 KB
Node.js (WASM)258 KB
Cloudflare Workers258 KB
Linux / macOS / Windows CLI533 KB
Python (native wheel)pip install needle-rs
C / C++ / Go / Swift (via FFI)557 KB
no_std embedded (Rust)size varies
iOS / Android (use Cactus)
Apple NPU / Snapdragon NPU (use Cactus)

Cactus's official engine targets mobile and NPUs with hand-tuned ARM SIMD. needle-rs targets everywhere else. The two stacks are complementary.


How it works

Needle is a 26M-parameter encoder-decoder transformer with a small twist: it's trained to do exactly one thing — emit a function-call JSON object from a query and a tool list. That focus is why a model this small works at all.

1 Encoder–decoder SAN. The encoder reads the query and tool definitions once. The decoder generates output JSON token by token, attending to the encoder's cached KV. Single forward pass per call.
2 INT4 quantization. All attention and FFN weights are packed 4-bit nibbles with per-32-row scales. Matvec dequantizes on the fly — the full f32 weight matrix is never materialized. AVX2 on x86_64, NEON on aarch64, scalar fallback for WASM.
3 Constrained decoding. A character-level trie over valid tool names and argument keys, plus a three-state JSON machine, masks logits at every step. Output is always syntactically valid JSON pointing at a real tool — never a hallucinated function name, never broken syntax.
4 Two schema formats. Accepts both the flat {"location": {"type": "string"}} style and OpenAI's {"type":"object","properties":{...}} style. The Python reference handles only the flat form.
5 Greedy by design. Tool calling is a routing task, not a generation task — temperature would only introduce errors. needle-rs is argmax-only and intentionally does not support stochastic sampling.

Architecture deep-dive: ARCHITECTURE.md.


Parity   560/560 token-exact

A common failure mode for from-scratch model reimplementations is silent drift — outputs that look right but diverge at the third decimal place, with rare and untraceable downstream bugs. needle-rs rejects that. The Rust engine is required to produce the exact same token ID sequence as the Python/JAX reference on every input, at every decode step.

The test suite generates 560 inference examples by running the Python model on a diverse input set: five tool-name conventions (snake_case, camelCase, PascalCase, UPPER_SNAKE_CASE, kebab-case), parameter counts from 0 to 8, tool arrays from 1 to 20 entries, and a spread of natural-language query phrasings. For each example we capture the Python model's complete output token sequence. The Rust engine is then run on every example and required to produce the identical sequence.

560 / 560 pass. Not approximately — same argmax decision at every step.

Token-exact parity is checked on every CI run. Any change that drifts gets caught before merge. The reference vectors are committed to the repo, so the parity contract is version-pinned and reproducible without re-running Python:

Plus 55 unit tests on the constrained decoder covering edge cases the parity suite doesn't reach (empty tool arrays, parameter-less tools, name-collision under snake_case normalization, max-length queries).


API

JavaScript / TypeScript
engine.run(query, tools)                              // → string
engine.run_stream(query, tools, (id, piece) => {})    // per-token callback → final string
engine.run_batch([{ query, tools }, ...])             // → string[]
engine.encode_contrastive(text)                       // → Float32Array | null
engine.retrieve_tools(query, descriptionsJson, topK)  // semantic tool routing
Rust
engine.run(query, tools_json);
engine.run_stream(query, tools_json, |_id, piece| print!("{piece}"));
engine.run_batch(&[(q1, t1), (q2, t2)]);
engine.encode_contrastive(text);            // → Option<Vec<f32>>
engine.retrieve_tools(query, descs, k);     // → Vec<(usize, f32)>
C (and anything with FFI)
#include "needle.h"

NeedleHandle h  = needle_load("needle.safetensors", "vocab.txt");
const char *out = needle_run(h, query, tools_json);
printf("%s\n", out);
needle_free_str((char *)out);
needle_free(h);

Full header: crates/needle-c/include/needle.h. Null-safe throughout; errors via thread-local needle_last_error().


Benchmarks

Intel i7-1185G7 (Tiger Lake, 4-core), Linux, release build, median of 5 runs.

End-to-end (load + infer)283 ms
Warm inference only~80 ms
INT4 matvec 512×512 (AVX2)83 µs · 3.2 Gelem/s
INT4 matvec 2048×512 (AVX2)311 µs · 3.1 Gelem/s

Apple Silicon NEON path is implemented but unbenchmarked — M-series numbers welcome via PR.

Footprint, stripped release:

WASM module258 KB
CLI binary533 KB
C shared library557 KB
Weights (INT4 SafeTensors)22 MB
Runtime dependencies1 (libm; WASM adds wasm-bindgen)

Full methodology and raw numbers: BENCHMARKS.md.


What it's good for

✓ Browser-side intent routing Decide which API to call before making the network request. Sub-second, zero servers.

✓ Edge function dispatch Tool calling inside Cloudflare Workers, Vercel Edge, Deno Deploy — anywhere with a WASM runtime.

✓ On-device privacy User queries never leave the browser tab. Useful for healthcare, legal, and any context where sending text to OpenAI is a non-starter.

✓ Embedded agents no_std core means the kernels run on microcontrollers with enough RAM for the weights.

What it's not good for: open-ended chat, long-context reasoning, anything where you'd reach for a >1B-parameter model. Needle is a router, not a generalist.


Acknowledgements

Needle is designed and trained by Henry Ndubuaku and the Cactus Compute team. The model architecture, training code, dataset, and weights are entirely their work, released under MIT. needle-rs is an independent Rust runtime — no upstream code is copied, only the published architecture is implemented.

If you find this useful, please star the upstream Needle repo as well.


Citation

@software{needle2026,
  author  = {Ndubuaku, Henry and {Cactus Compute}},
  title   = {Needle: A 26M-Parameter Tool-Calling Transformer},
  year    = {2026},
  url     = {https://github.com/cactus-compute/needle},
  license = {MIT}
}

@software{needlers2026,
  author  = {Ibrahim, Abdalrahman},
  title   = {needle-rs: Pure-Rust WASM Runtime for Needle},
  year    = {2026},
  url     = {https://github.com/geekgineer/needle-rs},
  license = {MIT}
}

MIT — see LICENSE. Model and weights by Cactus Compute, also MIT.