Autonomous prompt optimizer. Give it a prompt and test cases — it will iteratively mutate, score, and improve your prompt using the Karpathy autoresearch pattern.
Try the live demo at honeprompt.vercel.app
| Score | |
|---|---|
| Original prompt (hand-written) | 55 / 100 |
| Optimized prompt (15 iterations, $1.40) | 82 / 100 |
| Feature | HonePrompt | DSPy | PromptFoo | OpenAI Optimizer |
|---|---|---|---|---|
| Optimizes individual prompts | Yes | No (LLM programs) | No (testing only) | Yes |
| TypeScript-native | Yes | No (Python) | Yes | No (Python) |
| Works with any model | Yes | Yes | Yes | OpenAI only |
| CLI + Web UI | Yes | CLI only | CLI + UI | API only |
| Mutation strategies | 6 strategies | Bootstrapping | N/A | Gradient-free |
| Multi-dimensional rubrics | Yes | No | No | No |
| Multimodal (vision + image-gen) | Yes | No | No | No |
| Resume / continue runs | Yes | No | No | No |
| Real-time progress | SSE streaming | No | No | No |
| Self-hostable | Yes | Yes | Yes | No |
honeprompt init linkedin-hooks
honeprompt run
Load prompt.md + test-cases.json
|
+----v----+
| Baseline | Execute prompt against all test cases, score with LLM judge
+----+----+
|
+----v--------------------------+
| Loop (N iterations) |
| |
| 1. Failure report | Identify lowest-scoring test cases
| 2. Mutate | Optimizer agent picks a strategy
| 3. Re-score | Execute mutated prompt, judge outputs
| 4. Keep or revert | Score improved? Keep. Otherwise revert.
| 5. Log to history.jsonl |
| |
+----+--------------------------+
|
+----v----+
| Output | Optimized prompt.md + progress.png + report.json
+---------+
npm install -g honeprompthoneprompt init linkedin-hooks
cd linkedin-hooksThis creates:
prompt.md— the prompt to optimizetest-cases.json— test cases with inputs and expected outputshoneprompt.config.ts— model selection, budget, scoring criteriaprogram.md— strategy document that shapes optimizer behavior
export ANTHROPIC_API_KEY=sk-ant-...
honeprompt runhoneprompt eval// honeprompt.config.ts
import type { HonePromptConfig } from "honeprompt";
const config: HonePromptConfig = {
// Model that executes your prompt
targetModel: {
provider: "anthropic",
model: "claude-sonnet-4-5-20250929",
},
// Model that generates mutations
optimizerModel: {
provider: "anthropic",
model: "claude-sonnet-4-5-20250929",
},
// Model that judges output quality
judgeModel: {
provider: "anthropic",
model: "claude-sonnet-4-5-20250929",
},
maxIterations: 25,
maxCostUsd: 5.0,
parallelTestCases: 5,
scoring: {
mode: "llm-judge",
criteria: "Score the output 0-100 on relevance, quality, and completeness.",
},
failureReportSize: 3,
targetScore: 90,
// Stop if N consecutive mutations are reverted (default: 5)
plateauThreshold: 5,
};
export default config;Works with any model via three providers:
| Provider | Models | Notes |
|---|---|---|
anthropic |
Claude Sonnet 4.5, Claude Opus 4.6, Claude Haiku 4.5 | API key required |
openai |
GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano | API key required |
claude-cli |
Any model via Claude Code CLI | Zero-cost if on Claude Max plan |
Use different models for different roles — e.g., cheap model as target, expensive model as judge.
llm-judge— LLM scores each output against your criteria (default)programmatic— Your custom eval function scores outputshybrid— Weighted combination of both
For programmatic scoring, export a function:
// eval.ts
import type { TestCase } from "honeprompt";
export default function score(output: string, testCase: TestCase): number {
// Return 0-100
if (output.length > 200) return 30; // too long
if (!output.includes(testCase.input.split(" ")[0])) return 50; // missed topic
return 85;
};Then set scoring.evalPath: "./eval.ts" in your config.
Replace the single 0-100 score with weighted dimensions:
scoring: {
mode: "llm-judge",
dimensions: [
{ name: "accuracy", weight: 0.4, criteria: "Factual correctness" },
{ name: "tone", weight: 0.3, criteria: "Professional yet approachable" },
{ name: "format", weight: 0.3, criteria: "Clean markdown, scannable structure" },
],
},Each dimension is scored independently by the judge, then combined into a weighted composite score. The optimizer sees per-dimension breakdowns in failure reports, so it can target specific weaknesses.
The optimizer uses Claude tool use to apply structured mutations:
| Strategy | When Used |
|---|---|
sharpen |
Tighten vague instructions |
add_example |
Model misunderstands format/tone |
remove |
Contradictory or redundant rules |
restructure |
Information is buried or poorly ordered |
constrain |
Output goes off-track |
expand |
Instructions are under-specified |
One mutation per iteration. The optimizer learns from history — if a strategy was reverted, it tries a different approach.
Add a program.md file to guide the optimizer with domain knowledge, constraints, or preferred mutation patterns. The strategy document is prepended to the optimizer system prompt.
honeprompt run --strategy program.mdIf program.md exists in your project directory, it is auto-detected.
Include images in your test cases for vision-capable models:
{
"id": "chart-analysis",
"input": "Describe what this chart shows",
"images": ["https://example.com/chart.png"]
}Set mode: "image-gen" to optimize prompts for image generation models (DALL-E). The judge uses vision to score generated images against your criteria.
HonePrompt stops automatically when:
- Target score reached — score hits your
targetScorethreshold - Budget exhausted — total cost exceeds
maxCostUsd - Plateau detected — N consecutive mutations reverted (configurable via
plateauThreshold) - Cancelled — manual stop via CLI (Ctrl+C) or web UI
Stopped or cancelled runs can be resumed. The JSONL history file is append-only and crash-safe.
# CLI: resumes from the last saved state
honeprompt run --resumeIn the web UI, completed/cancelled/plateau runs show a "Continue Run" button.
Start with a ready-made template instead of building from scratch:
honeprompt init --list # Browse all templates
honeprompt init linkedin-hooks # Scaffold from a template| Template | Category | Description |
|---|---|---|
linkedin-hooks |
Creators | Scroll-stopping LinkedIn post hooks |
tiktok-hooks |
Creators | Spoken-word TikTok video hooks |
youtube-titles |
Creators | Click-worthy YouTube titles under 60 chars |
claude-md-optimizer |
Developers | Actionable CLAUDE.md files for AI assistants |
code-review-comments |
Developers | Constructive, specific code review comments |
commit-messages |
Developers | Conventional commit messages from diffs |
product-descriptions |
Marketers | Benefit-driven product copy |
seo-meta-descriptions |
Marketers | SEO meta descriptions under 160 chars |
cold-outreach-email |
SaaS | Personalized cold emails under 100 words |
customer-support-replies |
SaaS | Empathetic support replies with clear resolutions |
Browse templates in the web UI at honeprompt.vercel.app/templates.
Want to contribute a template? See CONTRIBUTING.md for guidelines. Copy templates/_template/ to get started.
honeprompt init [template] # Scaffold a project
honeprompt init --list # List all available templates
honeprompt run [options] # Run optimization loop
honeprompt eval [options] # Score current prompt (no optimization)
honeprompt generate-tests [opt] # Generate test cases from a prompt using AI
honeprompt diff [options] # Show diff between original and optimized prompt
honeprompt estimate [options] # Estimate cost for an optimization run
# Run options
-p, --prompt <path> # Prompt file (default: prompt.md)
-t, --tests <path> # Test cases file (default: test-cases.json)
-c, --config <path> # Config file (default: honeprompt.config.ts)
-o, --output <path> # Output directory (default: .honeprompt)
-n, --iterations <n> # Override max iterations
--budget <usd> # Override max cost
--strategy <path> # Strategy document (default: auto-detect program.md)
--resume # Resume a previous runimport { run, scorePrompt, generateMutation } from "honeprompt";
// Run the full optimization loop
const report = await run({
promptPath: "./prompt.md",
testCasesPath: "./test-cases.json",
config: { /* ... */ },
outputDir: "./.honeprompt",
});
console.log(`Improved ${report.baselineScore} -> ${report.finalScore}`);
console.log(`Stop reason: ${report.stopReason}`);After a run, .honeprompt/ contains:
history.jsonl— every iteration as a JSON line (append-only, crash-safe)progress.png— score chart with baseline, improvements, and revertsreport.json— full run summary with strategy stats
Add this badge to your project README to show your prompts were optimized with HonePrompt:
[](https://github.com/jerrysoer/honeprompt)Typical costs for a 25-iteration run with 10 test cases:
| Configuration | Estimated Cost |
|---|---|
| Sonnet for all three roles | $1-3 |
| Haiku target, Sonnet optimizer/judge | $0.50-1.50 |
| Sonnet target, Opus optimizer, Sonnet judge | $3-8 |
| Claude CLI (Max plan) for all roles | $0 |
Set maxCostUsd in config to cap spending. The loop stops when the budget is reached.
Use honeprompt estimate to preview costs before starting a run.
How is this different from DSPy? DSPy optimizes LLM programs (chains of calls). HonePrompt optimizes individual prompts — simpler scope, TypeScript-native, works with any model.
How is this different from PromptFoo? PromptFoo tests prompts. HonePrompt optimizes them. They are complementary — use PromptFoo to evaluate, HonePrompt to improve.
Does it work with local models?
Yes — set baseUrl in your model config to point to any OpenAI-compatible API (Ollama, vLLM, etc.).
Can I use it without an API key?
Yes — use the claude-cli provider with a Claude Max subscription for zero-cost optimization runs.
Can I resume a failed run?
Yes — use honeprompt run --resume in the CLI or the "Continue Run" button in the web UI. State is reconstructed from the crash-safe JSONL history file.
MIT