Skip to content
/ lm Public

Offload Tasks from Claude to your Local LLM With Houtini-LM - uses OpenAPI for LM Studio and Ollama Compatibility. Save tokens by offloading some grunt work for your API - our tool description helps claude decide what work to assign and why.

License

Notifications You must be signed in to change notification settings

houtini-ai/lm

Repository files navigation

@houtini/lm

npm version License: MIT

MCP server that connects Claude to any OpenAI-compatible LLM endpoint — LM Studio, Ollama, vLLM, llama.cpp, or any remote API.

Offload routine work to a local model. Keep your Claude context window for the hard stuff.

Why

Claude is great at orchestration and reasoning. Local models are great at bulk analysis, classification, extraction, and summarisation. This server lets Claude delegate to a local model on the fly — no API keys, no cloud round-trips, no context wasted.

Common use cases:

  • Classify or tag hundreds of items without burning Claude tokens
  • Extract structured data from long documents
  • Run a second opinion on generated code
  • Summarise research before Claude synthesises it
  • Delegate code review to a local model while Claude handles other work

What's new in v2.1.0

  • Smarter tool descriptions — tool descriptions now encode prompting best practices for local LLMs, so Claude automatically sends well-structured prompts (complete code, capped output tokens, explicit format instructions)
  • New code_task tool — purpose-built for code analysis with an optimised system prompt and sensible defaults (temp 0.2, 500 token cap)
  • Delegation guidance — each tool description tells Claude when to use it, what output to expect, and what to avoid (e.g. never send truncated code to a local model)

Install

Claude Code (recommended)

claude mcp add houtini-lm -e LM_STUDIO_URL=http://localhost:1234 -- npx -y @houtini/lm

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "houtini-lm": {
      "command": "npx",
      "args": ["-y", "@houtini/lm"],
      "env": {
        "LM_STUDIO_URL": "http://localhost:1234"
      }
    }
  }
}

npx (standalone)

npx @houtini/lm

Configuration

Set via environment variables or in your MCP client config:

Variable Default Description
LM_STUDIO_URL http://localhost:1234 Base URL of the OpenAI-compatible API
LM_STUDIO_MODEL (auto-detect) Model identifier — leave blank to use whatever's loaded
LM_STUDIO_PASSWORD (none) Bearer token for authenticated endpoints

Tools

chat

Delegate a bounded task to the local LLM. The workhorse for quick questions, code explanation, and pattern recognition.

message (required) — the task, with explicit output format instructions
system             — persona (be specific: "Senior TypeScript dev", not "helpful assistant")
temperature        — 0.1 for code, 0.3 for analysis (default), 0.5 for suggestions
max_tokens         — match to expected output: 150 for quick answers, 300 for explanations, 500 for code gen

Tip: Always send complete code — local models hallucinate details for truncated input.

custom_prompt

Structured 3-part prompt with separate system, context, and instruction fields. The separation prevents context bleed in local models — better results than stuffing everything into a single message.

instruction (required) — what to produce (under 50 words works best)
system                 — persona, specific and under 30 words
context                — COMPLETE data to analyse (never truncated)
temperature            — 0.1 for review, 0.3 for analysis (default)
max_tokens             — 200 for bullets, 400 for detailed review, 600 for code gen

code_task

Purpose-built for code analysis. Wraps the local LLM with an optimised code-review system prompt and low temperature (0.2).

code (required)     — complete source code (never truncate)
task (required)     — what to do: "Find bugs", "Explain this function", "Add error handling"
language            — "typescript", "python", "rust", etc.
max_tokens          — default 500 (200 for quick answers, 800 for code generation)

The local LLM excels at: explaining code, finding common bugs, suggesting improvements, comparing patterns, generating boilerplate.

It struggles with: subtle/adversarial bugs, multi-file reasoning, design tasks requiring integration.

list_models

Returns the models currently loaded on the LLM server.

health_check

Checks connectivity. Returns response time, auth status, and loaded model count.

Performance guide

At typical local LLM speeds (~3-4 tokens/second on consumer hardware):

max_tokens Response time Best for
150 ~45 seconds Quick questions, classifications
300 ~100 seconds Code explanations, summaries
500 ~170 seconds Code review, generation

Set max_tokens to match your expected output — lower values mean faster responses.

Compatible endpoints

Provider URL Notes
LM Studio http://localhost:1234 Default, zero config
Ollama http://localhost:11434 Use OpenAI-compatible mode
vLLM http://localhost:8000 Native OpenAI API
llama.cpp http://localhost:8080 Server mode
Remote / cloud APIs Any URL Set LM_STUDIO_URL + LM_STUDIO_PASSWORD

Development

git clone https://github.com/houtini-ai/lm.git
cd lm
npm install
npm run build

Run the test suite against a live LLM server:

node test.mjs

License

MIT

About

Offload Tasks from Claude to your Local LLM With Houtini-LM - uses OpenAPI for LM Studio and Ollama Compatibility. Save tokens by offloading some grunt work for your API - our tool description helps claude decide what work to assign and why.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •