inferis-ml

Run AI models in the browser. No server, no per-request cost, no data leaving the device.

Live Demo — try it in your browser.

import { createPool } from 'inferis-ml';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';

const pool = await createPool({ adapter: transformersAdapter() });
const model = await pool.load<number[][]>('feature-extraction', {
  model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
});

const embeddings = await model.run(['Hello world', 'Another sentence']);

Why

Existing browser runtimes (transformers.js, web-llm, onnxruntime-web) give you inference but leave everything else to you — worker management, postMessage boilerplate, model lifecycle, memory budgets, cross-tab dedup, WebGPU fallback, streaming.

inferis-ml handles all of it. You get a clean async API and focus on the product.

Problem	Without inferis-ml	With inferis-ml
UI freezes during inference	Main thread blocked	Runs in Web Workers
5 tabs = 5 model copies	10 GB RAM, browser crashes	`crossTab: true` — one shared copy
WebGPU not everywhere	Manual detection + swap	`defaultDevice: 'auto'`

Install

npm install inferis-ml

# Pick your adapter (peer deps):
npm install @huggingface/transformers   # transformersAdapter
npm install @mlc-ai/web-llm             # webLlmAdapter
npm install onnxruntime-web             # onnxAdapter

Quick Start

LLM Streaming

import { createPool } from 'inferis-ml';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

const pool = await createPool({
  adapter: webLlmAdapter(),
  defaultDevice: 'webgpu',
  maxWorkers: 1,
});

const llm = await pool.load<string>('text-generation', {
  model: 'Llama-3.2-3B-Instruct-q4f32_1-MLC',
  onProgress: ({ phase }) => console.log(phase),
});

const stream = llm.stream({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain WebGPU in 3 sentences.' },
  ],
});

for await (const token of stream) {
  output.textContent += token;
}

Speech Transcription

const transcriber = await pool.load<{ text: string }>('automatic-speech-recognition', {
  model: 'openai/whisper-base',
  estimatedMemoryMB: 80,
});

const result = await transcriber.run(audioData);
console.log(result.text);

Abort Inference

const ctrl = new AbortController();
stopButton.onclick = () => ctrl.abort();

try {
  for await (const token of llm.stream(input, { signal: ctrl.signal })) {
    output.textContent += token;
  }
} catch (e) {
  if (e.name === 'AbortError') output.textContent += ' [stopped]';
}

Cross-Tab Deduplication

const pool = await createPool({
  adapter: transformersAdapter(),
  crossTab: true, // SharedWorker > leader election > per-tab fallback
});

Model State Changes

model.onStateChange((state) => {
  if (state === 'loading')  showSpinner();
  if (state === 'ready')    hideSpinner();
  if (state === 'error')    showError('Failed to load model');
  if (state === 'disposed') disableUI();
});

Features

Runtime-agnostic — adapters for @huggingface/transformers, @mlc-ai/web-llm, onnxruntime-web, or your own
Zero framework deps — works with React, Vue, Svelte, or vanilla JS
WebGPU -> WASM fallback — auto-detected or configured explicitly
Streaming — ReadableStream + for await for token-by-token output
Memory budget — LRU eviction when models exceed the configured cap
Cross-tab dedup — SharedWorker (tier 1), leader election (tier 2), per-tab (tier 3)
AbortController — cancel any in-flight inference
TypeScript — full type safety, generic output types

API Reference

`createPool(config)`

const pool = await createPool({
  adapter: transformersAdapter(),   // required
  workerUrl: new URL('inferis-ml/worker', import.meta.url),
  maxWorkers: navigator.hardwareConcurrency - 1,
  maxMemoryMB: 2048,
  defaultDevice: 'auto',           // 'webgpu' | 'wasm' | 'auto'
  crossTab: false,
  taskTimeout: 120_000,
});

`pool.load<TOutput>(task, config)`

Loads a model and returns a ModelHandle. If already loaded, returns the existing handle.

const model = await pool.load<number[][]>('feature-extraction', {
  model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
  estimatedMemoryMB: 30,
  onProgress: (p) => { ... },
});

`ModelHandle<TOutput>`

Method	Description
`run(input, options?)`	Non-streaming inference. Returns `Promise<TOutput>`.
`stream(input, options?)`	Streaming inference. Returns `ReadableStream<TOutput>`.
`dispose()`	Unload model and free memory.
`onStateChange(cb)`	Subscribe to state changes. Returns unsubscribe function.
`id`	Unique model ID (`task:model`).
`state`	Current state: `idle \| loading \| ready \| inferring \| unloading \| error \| disposed`.
`memoryMB`	Approximate memory usage.
`device`	Resolved device: `webgpu` or `wasm`.

`InferenceOptions`

interface InferenceOptions {
  signal?: AbortSignal;
  priority?: 'high' | 'normal' | 'low';
}

`detectCapabilities()`

import { detectCapabilities } from 'inferis-ml';

const caps = await detectCapabilities();
if (caps.webgpu.supported) {
  console.log('GPU vendor:', caps.webgpu.adapter?.vendor);
} else {
  console.log('WASM SIMD:', caps.wasm.simd);
}

Custom Adapter

import type { ModelAdapter, ModelAdapterFactory } from 'inferis-ml';

export function myCustomAdapter(): ModelAdapterFactory {
  return {
    name: 'my-adapter',

    async create(): Promise<ModelAdapter> {
      const { MyRuntime } = await import('my-runtime');

      return {
        name: 'my-adapter',

        estimateMemoryMB(_task, config) {
          return (config.estimatedMemoryMB as number) ?? 50;
        },

        async load(task, config, device, onProgress) {
          onProgress({ phase: 'loading', loaded: 0, total: 1 });
          const instance = await MyRuntime.load(config.model as string, { device });
          onProgress({ phase: 'done', loaded: 1, total: 1 });
          return { instance, memoryMB: 50 };
        },

        async run(model, input) {
          return (model.instance as MyRuntime).infer(input);
        },

        async stream(model, input, onChunk) {
          for await (const chunk of (model.instance as MyRuntime).stream(input)) {
            onChunk(chunk);
          }
        },

        async unload(model) {
          await (model.instance as MyRuntime).dispose();
        },
      };
    },
  };
}

Framework Integrations

Official bindings with idiomatic APIs for popular frameworks:

Package	Install	Docs
inferis-react	`npm i inferis-react`	README
inferis-vue	`npm i inferis-vue`	README
inferis-svelte	`npm i inferis-svelte`	README

Each package provides context/provider setup, model lifecycle management, streaming, capability detection, and memory monitoring -- all wired into the framework's reactivity system.

// React
const { text, start } = useStream(model);

// Vue
const { text, start } = useStream(model);

// Svelte
const { text, start } = useStream(model);  // $text in template

Bundler & Framework Setup

inferis-ml is browser-only. In SSR frameworks, ensure initialization runs only on the client.

Vite

// vite.config.ts
export default {
  worker: { format: 'es' },
};

webpack 5

// webpack.config.js
module.exports = {
  experiments: { asyncWebAssembly: true },
};

Next.js

'use client';

import { useEffect, useState } from 'react';
import type { WorkerPoolInterface } from 'inferis-ml';

export default function AI() {
  const [pool, setPool] = useState<WorkerPoolInterface | null>(null);

  useEffect(() => {
    import('inferis-ml').then(({ createPool }) =>
      createPool({ adapter: { type: 'transformers' } })
    ).then(setPool);
  }, []);

  if (!pool) return <p>Loading...</p>;
  // use pool
}

Nuxt

<template>
  <ClientOnly>
    <InferenceComponent />
  </ClientOnly>
</template>

// composables/useInferis.ts
export async function useInferis() {
  const { createPool } = await import('inferis-ml');
  return createPool({ adapter: { type: 'transformers' } });
}

SvelteKit

import { browser } from '$app/environment';

let pool;
if (browser) {
  const { createPool } = await import('inferis-ml');
  pool = await createPool({ adapter: { type: 'transformers' } });
}

Popular Models

Models download from Hugging Face Hub on first use and are cached in the browser's Cache API. Subsequent loads are instant and work offline.

Embeddings / Semantic Search

Model	Size	Notes
`mixedbread-ai/mxbai-embed-xsmall-v1`	23 MB	Best quality/size for English
`Xenova/all-MiniLM-L6-v2`	23 MB	Popular multilingual
`Xenova/multilingual-e5-small`	118 MB	100+ languages

Text Generation (LLM)

Requires @mlc-ai/web-llm + defaultDevice: 'webgpu'.

Model	Size	Notes
`Llama-3.2-1B-Instruct-q4f32_1-MLC`	0.8 GB	Fastest
`Llama-3.2-3B-Instruct-q4f32_1-MLC`	2 GB	Good balance
`Phi-3.5-mini-instruct-q4f16_1-MLC`	2.2 GB	Strong reasoning
`gemma-2-2b-it-q4f16_1-MLC`	1.5 GB	Fast on mobile GPU

Speech Recognition

Model	Size	Notes
`openai/whisper-tiny`	39 MB	Fastest
`openai/whisper-base`	74 MB	Good balance
`openai/whisper-small`	244 MB	Better accuracy

Text Classification

Model	Size	Notes
`Xenova/distilbert-base-uncased-finetuned-sst-2-english`	67 MB	Sentiment
`Xenova/toxic-bert`	438 MB	Toxicity detection

Translation

Model	Size	Notes
`Xenova/opus-mt-en-ru`	74 MB	EN -> RU
`Xenova/opus-mt-ru-en`	74 MB	RU -> EN
`Xenova/nllb-200-distilled-600M`	600 MB	200 languages

Image Classification

Model	Size	Notes
`Xenova/efficientnet-lite4`	13 MB	Fastest, 1000 classes
`Xenova/mobilevit-small`	22 MB	Mobile-friendly

Model Sources

Models are not locked to Hugging Face. Each adapter has its own sources:

transformers.js — HF Hub ID or any direct URL
web-llm — MLC registry, or register custom models
onnxruntime-web — direct URL to .onnx file
Custom adapter — load from anywhere (fetch, IndexedDB, bundled)

Caching

First visit:  download -> Cache API -> run  (5-60s)
Next visits:  Cache API -> run              (1-3s, no network)
Offline:      Cache API -> run              (works without internet)

Browser Support

Feature	Chrome	Firefox	Safari	Edge
Core (Worker + WASM)	57+	52+	11+	16+
WebGPU	113+	141+	26+	113+
WASM SIMD	91+	89+	16.4+	91+
SharedWorker	4+	29+	16+	79+
Leader Election	69+	96+	15.4+	79+

Minimum: Web Workers + WebAssembly (97%+ of browsers). All advanced features are progressive enhancements.

Performance Tips

maxWorkers: 1 for GPU-bound workloads (LLMs)
defaultDevice: 'webgpu' when targeting modern hardware
estimatedMemoryMB for accurate LRU eviction
crossTab: true for multi-tab apps (chat, editors)
Reuse ModelHandle — re-loading a ready model is a no-op

When To Use

Use case	Fit?
Semantic search, chatbot, speech, classification, translation	Yes
Private data (never leaves device)	Yes
Offline after first load	Yes
Server-side batch processing	No
Models > 4 GB	No

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
packages		packages
src		src
tests/unit		tests/unit
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
rollup.worker.config.ts		rollup.worker.config.ts
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

inferis-ml

Why

Install

Quick Start

LLM Streaming

Speech Transcription

Abort Inference

Cross-Tab Deduplication

Model State Changes

Features

API Reference

createPool(config)

pool.load<TOutput>(task, config)

ModelHandle<TOutput>

InferenceOptions

detectCapabilities()

Custom Adapter

Framework Integrations

Bundler & Framework Setup

Vite

webpack 5

Next.js

Nuxt

SvelteKit

Popular Models

Embeddings / Semantic Search

Text Generation (LLM)

Speech Recognition

Text Classification

Translation

Image Classification

Model Sources

Caching

Browser Support

Performance Tips

When To Use

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`createPool(config)`

`pool.load<TOutput>(task, config)`

`ModelHandle<TOutput>`

`InferenceOptions`

`detectCapabilities()`

Packages