Benchmark real LLM API behavior before you commit to a provider, gateway, or deployment.
Docs · Quickstart · Providers · PyPI · Releases
Pricing pages and model cards do not answer the questions that matter in production:
- Which provider has the best TTFT for my prompt shape?
- What happens when I increase concurrency?
- Is my gateway faster than the upstream provider?
- Did latency regress after a deploy, region change, or model switch?
llm-gateway-bench gives you a repeatable CLI workflow for measuring those answers against real endpoints.
| Measure | Compare | Export |
|---|---|---|
| TTFT, total latency, p50/p95, throughput, success rate | Providers, gateways, regions, releases, self-hosted endpoints | Markdown, JSON, CSV, plus local run history |
| Use it for | Typical target |
|---|---|
| Provider evaluation | OpenAI, Anthropic, Gemini, Groq, DeepSeek, OpenRouter |
| Gateway validation | OpenAI-compatible relay layers and API gateways |
| Infra regression checks | Regional changes, load balancers, model rollouts, self-hosted serving |
pip install llm-gateway-bench
# See built-in provider defaults
lgb providers
# Benchmark one provider/model quickly
lgb run --provider openai --model gpt-5-mini --requests 20 --concurrency 3 \
--prompt "Say hello in one sentence."
# Compare multiple providers from YAML
lgb compare example-bench.yaml --output report.md| Command | Purpose |
|---|---|
lgb run |
Run a single provider/model benchmark from CLI flags |
lgb compare |
Run a multi-provider suite from bench.yaml |
lgb warmup |
Verify provider reachability before a full run |
lgb history |
List and compare saved historical runs |
lgb providers |
Show built-in provider defaults and env var names |
prompts:
- "Write a haiku about the ocean."
providers:
- name: openai
model: gpt-5-mini
api_key: ${OPENAI_API_KEY}
- name: gemini
model: gemini-2.5-flash
base_url: https://generativelanguage.googleapis.com/v1beta/openai/
api_key: ${GEMINI_API_KEY}
- name: deepseek
model: deepseek-v3
base_url: https://api.deepseek.com/v1
api_key: ${DEEPSEEK_API_KEY}
settings:
requests: 20
concurrency: 3
timeout: 30lgb compare bench.yaml --output report.md --save- Start with
lgb providersto confirm defaults and environment variables. - Run
lgb warmup bench.yamlif you want a quick reachability check. - Use
lgb runwhile tuning a single provider or endpoint. - Use
lgb comparewhen you want a reproducible cross-provider report. - Save runs and compare them later with
lgb history --compare <id1> <id2>.
- Frontier APIs: OpenAI, Anthropic, Google Gemini
- Cost/performance providers: DeepSeek, Groq, Together, Fireworks, OpenRouter, Mistral, Cohere, Perplexity
- China-focused providers: DashScope, SiliconFlow, Zhipu, Moonshot, Baidu, 01AI, MiniMax
- Local and self-hosted endpoints: Ollama, vLLM, LM Studio
- Any OpenAI-compatible endpoint via
--base-urlor YAMLbase_url
The full provider matrix lives in docs/providers.md.
┌─────────────────┬──────────────────────┬──────────┬────────────┬──────────────┐
│ Provider │ Model │ TTFT (ms)│ Total (ms) │ Tokens/sec │
├─────────────────┼──────────────────────┼──────────┼────────────┼──────────────┤
│ openai │ gpt-5-mini │ 198 │ 1240 │ 94.5 │
│ anthropic │ claude-haiku-4 │ 312 │ 1680 │ 76.2 │
│ gemini │ gemini-2.5-flash │ 280 │ 1520 │ 82.1 │
│ deepseek │ deepseek-v3 │ 720 │ 2800 │ 48.3 │
│ groq │ llama-3.3-70b │ 95 │ 880 │ 210.5 │
└─────────────────┴──────────────────────┴──────────┴────────────┴──────────────┘
- The runner targets OpenAI-compatible
chat.completions.create(stream=True)endpoints. - Native provider-specific benchmarking flows are out of scope for now.
- If a provider claims compatibility but behaves differently, use
base_urland validate withwarmupfirst.
- Read the Quickstart
- Configure a suite in Configuration
- Check provider-specific notes in Providers
- Review advanced workflows in Advanced usage
PRs are welcome. See CONTRIBUTING.md and docs/contributing.md.
MIT. See LICENSE.
