Complete reference for all LLM Test Bench CLI commands.
Available for all commands:
--verbose, -v Enable verbose output
--no-color Disable colored output
--help, -h Print help information
--version, -V Print version informationRun a single test against an LLM provider.
Alias: t
llm-test-bench test [OPTIONS] --provider <PROVIDER> --prompt <PROMPT>--provider <PROVIDER>- Provider name (e.g., openai, anthropic)--model <MODEL>- Model to use (overrides provider default)--prompt <PROMPT>- Test prompt--expected <EXPECTED>- Expected output for validation--temperature <TEMP>- Temperature setting (0.0-2.0)--max-tokens <TOKENS>- Maximum tokens to generate--timeout <SECONDS>- Request timeout in seconds--config <PATH>- Path to custom configuration file
# Basic test
llm-test-bench test --provider openai --prompt "Explain quantum computing"
# Test with specific model
llm-test-bench test --provider openai --model gpt-4 --prompt "Hello"
# Test with validation
llm-test-bench test --provider anthropic --prompt "2+2" --expected "4"Run benchmark tests across multiple providers.
Alias: b
llm-test-bench bench [OPTIONS] --dataset <PATH> --providers <PROVIDERS>--dataset <PATH>- Path to dataset file (JSON or YAML)--providers <PROVIDERS>- Comma-separated list of providers--concurrency <N>- Number of concurrent requests (default: 5)--output <PATH>- Output directory (default: ./bench-results)--export <FORMAT>- Export format: json, csv, both (default: both)--continue-on-failure- Continue on test failure (default: true)--save-responses- Save raw responses (default: true)--delay <MS>- Request delay in milliseconds--config <PATH>- Path to custom configuration file--metrics <METRICS>- Comma-separated evaluation metrics--judge-model <MODEL>- Judge model for evaluations--judge-provider <PROVIDER>- Judge provider--dashboard- Generate HTML dashboard after benchmark
# Basic benchmark
llm-test-bench bench --dataset tests.json --providers openai,anthropic
# Benchmark with evaluation
llm-test-bench bench \
--dataset tests.json \
--providers openai \
--metrics faithfulness,relevance \
--dashboard
# Benchmark with custom concurrency
llm-test-bench bench \
--dataset tests.json \
--providers openai \
--concurrency 10 \
--delay 100Evaluate test results with metrics.
Alias: e
llm-test-bench eval [OPTIONS] --results <PATH>--results <PATH>- Path to results file--metrics <METRICS>- Comma-separated evaluation metrics--judge-model <MODEL>- Judge model for evaluations--output <PATH>- Output file for evaluation results--config <PATH>- Path to custom configuration file
# Evaluate results
llm-test-bench eval --results bench-results/openai-results.json
# Evaluate with specific metrics
llm-test-bench eval \
--results results.json \
--metrics faithfulness,relevance,coherenceCompare multiple models on the same prompt or dataset.
Alias: c
llm-test-bench compare [OPTIONS] --models <MODELS>--prompt <PROMPT>- Single prompt to test (conflicts with --dataset)--dataset <PATH>- Dataset file for batch comparison--models <MODELS>- Comma-separated models (format: provider:model)--metrics <METRICS>- Evaluation metrics (default: faithfulness,relevance)--statistical-tests- Run statistical significance tests--output <FORMAT>- Output format: table, json, dashboard (default: table)--output-file <PATH>- Save results to file--dashboard- Generate HTML dashboard--config <PATH>- Path to custom configuration file--concurrency <N>- Maximum concurrent comparisons (default: 5)
# Compare two models on a prompt
llm-test-bench compare \
--prompt "Explain quantum computing" \
--models openai:gpt-4,anthropic:claude-3-opus
# Compare with statistical tests
llm-test-bench compare \
--prompt "Test prompt" \
--models openai:gpt-4,openai:gpt-3.5-turbo,anthropic:claude-3-sonnet \
--statistical-tests \
--dashboard
# Batch comparison
llm-test-bench compare \
--dataset tests.json \
--models openai:gpt-4,anthropic:claude-3-opus \
--metrics faithfulness,relevance,coherence \
--output-file comparison.jsonGenerate interactive HTML dashboards from results.
Alias: d
llm-test-bench dashboard [OPTIONS] --results <FILES> --output <PATH>--results <FILES>- Comma-separated result files to visualize--dashboard-type <TYPE>- Type: benchmark, comparison, analysis, custom (default: benchmark)--theme <THEME>- Theme: light, dark, auto (default: auto)--output <PATH>- Output file path (default: dashboard.html)--title <TITLE>- Dashboard title--include-raw-data- Include raw data in dashboard--config <PATH>- Path to custom configuration file
# Generate benchmark dashboard
llm-test-bench dashboard \
--results bench-results/*.json \
--output benchmark.html
# Generate comparison dashboard
llm-test-bench dashboard \
--results comparison-results.json \
--dashboard-type comparison \
--theme dark \
--output comparison-dashboard.html
# Multiple result files
llm-test-bench dashboard \
--results results1.json,results2.json,results3.json \
--title "Multi-Provider Comparison" \
--output multi-dashboard.htmlPerform statistical analysis comparing baseline and new results.
Alias: a
llm-test-bench analyze [OPTIONS] --baseline <PATH> --comparison <PATH>--baseline <PATH>- Baseline results file--comparison <PATH>- Comparison results file--metric <METRIC>- Metric to analyze (default: overall)--confidence-level <LEVEL>- Confidence level: 0.90, 0.95, 0.99 (default: 0.95)--fail-on-regression- Exit with error code 2 if regression detected--effect-size-threshold <THRESHOLD>- Effect size threshold (default: 0.2)--output <FORMAT>- Output format: detailed, summary, json (default: detailed)--report-file <PATH>- Save report to file--config <PATH>- Path to custom configuration file
0- Success, no regression1- Error during analysis2- Regression detected (with --fail-on-regression)
# Basic analysis
llm-test-bench analyze \
--baseline baseline-results.json \
--comparison new-results.json
# Analysis with regression check
llm-test-bench analyze \
--baseline v1-results.json \
--comparison v2-results.json \
--metric faithfulness \
--fail-on-regression
# CI/CD integration
llm-test-bench analyze \
--baseline prod-baseline.json \
--comparison pr-results.json \
--confidence-level 0.99 \
--fail-on-regression \
--output summaryRecommend cost-optimized model alternatives.
Alias: o
llm-test-bench optimize [OPTIONS] --current-model <MODEL> --monthly-requests <N>--current-model <MODEL>- Current model (format: provider:model or model)--quality-threshold <THRESHOLD>- Quality threshold 0.0-1.0 (default: 0.75)--monthly-requests <N>- Monthly request volume--history <PATH>- Historical results for analysis--max-cost-increase <PERCENT>- Max acceptable cost increase % (default: 10.0)--min-quality <SCORE>- Minimum required quality score (default: 0.70)--include-experimental- Include experimental models--output <FORMAT>- Output format: detailed, summary, json (default: detailed)--report-file <PATH>- Save optimization report--config <PATH>- Path to custom configuration file
# Basic optimization
llm-test-bench optimize \
--current-model gpt-4 \
--monthly-requests 100000
# Optimization with constraints
llm-test-bench optimize \
--current-model openai:gpt-4 \
--monthly-requests 100000 \
--quality-threshold 0.85 \
--max-cost-increase 5.0
# Save detailed report
llm-test-bench optimize \
--current-model gpt-4 \
--monthly-requests 50000 \
--output detailed \
--report-file optimization-report.jsonManage configuration files and settings.
Display current configuration.
llm-test-bench config show [OPTIONS]Options:
--format <FORMAT>- Output format: toml, json (default: toml)--config <PATH>- Path to configuration file
Example:
llm-test-bench config show
llm-test-bench config show --format jsonInitialize a new configuration file.
llm-test-bench config init [OPTIONS]Options:
--force- Overwrite existing configuration--path <PATH>- Custom configuration path
Example:
llm-test-bench config init
llm-test-bench config init --path ./custom-config.tomlValidate configuration file.
llm-test-bench config validate [OPTIONS]Options:
--config <PATH>- Path to configuration file
Example:
llm-test-bench config validate
llm-test-bench config validate --config ./my-config.tomlShow configuration file path.
llm-test-bench config pathGenerate shell completion scripts.
llm-test-bench completions <SHELL>bashzshfishpowershellelvish
# Bash
llm-test-bench completions bash > ~/.local/share/bash-completion/completions/llm-test-bench
# Zsh
llm-test-bench completions zsh > ~/.zfunc/_llm-test-bench
# Fish
llm-test-bench completions fish > ~/.config/fish/completions/llm-test-bench.fishThe CLI uses a hierarchical configuration system:
- CLI Arguments (highest priority)
- Environment Variables (LLM_TEST_BENCH_ prefix)
- Config Files (~/.config/llm-test-bench/config.toml)
- Defaults (lowest priority)
# ~/.config/llm-test-bench/config.toml
[providers.openai]
api_key_env = "OPENAI_API_KEY"
base_url = "https://api.openai.com/v1"
default_model = "gpt-4-turbo"
timeout_seconds = 30
max_retries = 3
[providers.anthropic]
api_key_env = "ANTHROPIC_API_KEY"
base_url = "https://api.anthropic.com/v1"
default_model = "claude-3-sonnet-20240229"
timeout_seconds = 30
max_retries = 3
[benchmarks]
output_dir = "./bench-results"
save_responses = true
parallel_requests = 5
continue_on_failure = true
[evaluation]
metrics = ["perplexity", "faithfulness", "relevance", "latency"]
llm_judge_model = "gpt-4"
confidence_threshold = 0.7
include_explanations = true
[orchestration]
max_parallel_models = 5
comparison_timeout_seconds = 300
routing_strategy = "quality_first"
enable_caching = true
[analytics]
confidence_level = 0.95
effect_size_threshold = 0.2
quality_threshold = 0.75
min_sample_size = 30
[dashboard]
theme = "auto"
chart_colors = ["#3B82F6", "#10B981", "#F59E0B", "#EF4444"]
max_data_points = 1000
enable_interactive = trueOverride configuration with environment variables:
export LLM_TEST_BENCH_PROVIDERS__OPENAI__DEFAULT_MODEL="gpt-4"
export LLM_TEST_BENCH_BENCHMARKS__PARALLEL_REQUESTS=10
export LLM_TEST_BENCH_EVALUATION__LLM_JUDGE_MODEL="claude-3-opus"# Run benchmark
llm-test-bench bench \
--dataset tests.json \
--providers openai,anthropic \
--metrics faithfulness,relevance \
--dashboard
# Compare specific models
llm-test-bench compare \
--dataset tests.json \
--models openai:gpt-4,anthropic:claude-3-opus \
--statistical-tests \
--output-file comparison.json
# Generate dashboard
llm-test-bench dashboard \
--results bench-results/*.json,comparison.json \
--title "Complete Analysis" \
--output full-dashboard.html# In CI pipeline
llm-test-bench bench \
--dataset regression-tests.json \
--providers openai \
--output ./ci-results
llm-test-bench analyze \
--baseline prod-baseline.json \
--comparison ci-results/openai-results.json \
--fail-on-regression \
--confidence-level 0.99 \
--output summary
# Exit code 2 if regression detected# Analyze current costs
llm-test-bench optimize \
--current-model gpt-4 \
--monthly-requests 1000000 \
--quality-threshold 0.85 \
--report-file optimization-report.json
# Test recommended alternative
llm-test-bench compare \
--dataset sample-tests.json \
--models openai:gpt-4,anthropic:claude-3-sonnet \
--metrics faithfulness,relevance,coherence \
--dashboardexport OPENAI_API_KEY="your-api-key"
export ANTHROPIC_API_KEY="your-api-key"# Validate configuration
llm-test-bench config validate
# Show current configuration
llm-test-bench config show
# Reset to defaults
llm-test-bench config init --forceEnable verbose mode for debugging:
llm-test-bench --verbose <command> <options>0- Success1- General error2- Regression detected (with --fail-on-regression)3- Configuration error4- Invalid input5- Provider error (API key missing, rate limit, etc.)6- Cost limit exceeded
For issues and questions:
- GitHub Issues: https://github.com/llm-test-bench/llm-test-bench/issues
- Documentation: https://github.com/llm-test-bench/llm-test-bench/docs