Comprehensive latency profiling and performance analysis for Large Language Models
Quick Start • Features • Documentation • Benchmarks • Contributing
LLM-Latency-Lens is a production-grade, open-source profiling tool designed to measure, analyze, and optimize latency across all major LLM providers. Built in Rust for maximum performance and precision, it provides comprehensive performance insights for production LLM applications.
- Sub-millisecond Precision: Nanosecond-accurate timing for Time-to-First-Token (TTFT) and token streaming
- Multi-Provider Support: OpenAI, Anthropic, Google, Azure, Cohere, and more
- Enterprise-Ready: Battle-tested concurrency control, retry logic, and error handling
- Cost Analytics: Real-time cost tracking and ROI analysis
- Open Source: Apache 2.0 licensed, community-driven development
# Install as a library
npm install @llm-dev-ops/latency-lens
# Or install globally for CLI
npm install -g @llm-dev-ops/latency-lens
# Test the CLI
latency-lens test
latency-lens version# Add to your Cargo.toml
[dependencies]
llm-latency-lens-core = "0.1.0"
llm-latency-lens-providers = "0.1.2"
llm-latency-lens-metrics = "0.1.0"
llm-latency-lens-exporters = "0.1.0"
# Or use in your project
cargo add llm-latency-lens-core
cargo add llm-latency-lens-providers
cargo add llm-latency-lens-metrics
cargo add llm-latency-lens-exportersgit clone https://github.com/llm-dev-ops/llm-latency-lens.git
cd llm-latency-lens
cargo build --releasedocker pull llm-devops/llm-latency-lens:latest
docker run -e OPENAI_API_KEY=sk-... llm-devops/llm-latency-lens profile --provider openai --model gpt-4Download pre-built binaries for Linux, macOS, and Windows from our releases page.
import { LatencyCollector } from '@llm-dev-ops/latency-lens';
// Create collector with 60-second window
const collector = new LatencyCollector(60000);
// Start tracking a request
const requestId = collector.start_request('openai', 'gpt-4-turbo');
// Record first token received
collector.record_first_token(requestId);
// Record subsequent tokens
collector.record_token(requestId);
collector.record_token(requestId);
// Complete the request
collector.complete_request(requestId, 150, 800, null, 0.05);
// Get metrics
const metrics = collector.get_metrics();
console.log('TTFT P95:', metrics.ttft_distribution.p95_ms, 'ms');
console.log('Throughput:', metrics.throughput.tokens_per_second, 'tokens/sec');# Set your API key
export OPENAI_API_KEY=sk-...
# Profile OpenAI GPT-4
llm-latency-lens profile \
--provider openai \
--model gpt-4 \
--prompt "Explain quantum computing in simple terms" \
--iterations 100
# Profile with streaming enabled
llm-latency-lens profile \
--provider anthropic \
--model claude-3-opus-20240229 \
--prompt "Write a Python function to calculate Fibonacci numbers" \
--stream \
--iterations 50 \
--concurrency 10
# Compare multiple providers
llm-latency-lens compare \
--config benchmark.yaml \
--output results.jsonLLM Latency Lens - Benchmark Results
=====================================
Provider: openai | Model: gpt-4-turbo-preview
Duration: 15.3s | Requests: 100 | Concurrency: 20
┌──────────────────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ Metric │ Min │ Mean │ Median │ p95 │ p99 │
├──────────────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ TTFT (ms) │ 234.2 │ 456.8 │ 432.1 │ 678.9 │ 789.3 │
│ Total Duration (ms) │ 1234.5 │ 2456.7 │ 2389.4 │ 3456.8 │ 3789.2 │
│ Tokens/sec │ 12.3 │ 45.6 │ 44.2 │ 67.8 │ 72.1 │
│ Inter-token (ms) │ 8.2 │ 22.4 │ 21.8 │ 34.6 │ 42.3 │
└──────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Token Usage:
Prompt: 15,234 tokens | Completion: 45,678 tokens | Total: 60,912 tokens
Cost Analysis:
Total Cost: $1.23 | Cost/Request: $0.012 | Cost/1K tokens: $0.020
Success Rate: 98.0% (98/100)
- Nanosecond Accuracy: High-resolution timing using hardware counters
- TTFT Measurement: Critical metric for perceived responsiveness
- Inter-token Latency: Track consistency of token generation
- Network Breakdown: DNS, TLS, connection establishment timing
- OpenAI: GPT-4, GPT-4o, GPT-3.5 Turbo, o1, o3
- Anthropic: Claude 3 Opus, Sonnet, Haiku (including extended thinking)
- Google: Gemini Pro, Gemini Ultra (coming soon)
- Azure OpenAI: Full compatibility
- Cohere: Command models
- Custom Providers: Generic HTTP adapter for any API
- Statistical Metrics: Min, max, mean, median, std dev, percentiles (p50, p95, p99, p999)
- Histogram Generation: HDR histograms for accurate percentile calculation
- Real-time Streaming: Process tokens as they arrive
- Cost Tracking: Accurate pricing based on current rates
- High Throughput: Handle 1000+ concurrent requests
- Rate Limiting: Per-provider rate limiters with token bucket algorithm
- Retry Logic: Automatic retries with exponential backoff
- Connection Pooling: Efficient HTTP/2 connection reuse
- Multiple Formats: JSON, CSV, binary (MessagePack/Bincode)
- Time-series DBs: InfluxDB, Prometheus, Datadog
- CI/CD Integration: GitHub Actions, GitLab CI, Jenkins
- Grafana Dashboards: Pre-built visualization templates
- OpenTelemetry: Full trace and metrics export
Use as a Rust library in your applications:
use llm_latency_lens_providers::{OpenAIProvider, StreamingRequest, MessageRole};
use llm_latency_lens_core::TimingEngine;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let provider = OpenAIProvider::new("sk-...");
let timing = TimingEngine::new();
let request = StreamingRequest::builder()
.model("gpt-4o")
.message(MessageRole::User, "Explain quantum computing")
.max_tokens(500)
.temperature(0.7)
.build();
let response = provider.stream(request, &timing).await?;
println!("TTFT: {:?}", response.ttft);
println!("Total tokens: {}", response.metadata.completion_tokens);
Ok(())
}Define complex benchmark scenarios:
# benchmark.yaml
providers:
- name: openai
models: [gpt-4-turbo-preview, gpt-3.5-turbo]
- name: anthropic
models: [claude-3-opus-20240229]
workload:
scenarios:
- name: short_prompt_high_concurrency
prompt: "What is the capital of France?"
requests: 100
concurrency: 20
- name: long_prompt_streaming
prompt: "Write a comprehensive guide to machine learning"
requests: 50
concurrency: 5
stream: true
execution:
max_concurrency: 50
warmup_requests: 5
retry:
max_attempts: 3
initial_backoff_ms: 1000
output:
export:
- format: json
path: ./results/benchmark_{timestamp}.json
- format: csv
path: ./results/benchmark_{timestamp}.csv- User Guide - Comprehensive usage documentation
- Installation Guide - Detailed installation instructions
- Quick Start Tutorial - Get up and running in 5 minutes
- API Documentation - Library usage and integration patterns
- Configuration Reference - All configuration options
- Provider Guide - Provider-specific settings
- Architecture Overview - System design and components
- Data Flow - Request lifecycle and metrics pipeline
- Crate Structure - Internal organization
- Ecosystem Integration - Integration with other tools
- Performance Tuning - Optimization strategies
- Troubleshooting - Common issues and solutions
LLM-Latency-Lens itself has minimal performance overhead:
| Metric | Value |
|---|---|
| Timing Overhead | < 100 nanoseconds per measurement |
| Memory Usage | < 100MB baseline |
| CPU Usage | < 5% overhead per request |
| Throughput | 1000+ concurrent requests |
| Accuracy | ±0.1% percentile calculation |
Benchmarking OpenAI GPT-4 Turbo (100 requests, 20 concurrent):
- TTFT p50: 432.1ms
- TTFT p95: 678.9ms
- Throughput: 44.2 tokens/sec
- Success Rate: 98.0%
- Cost: $0.012 per request
- Profile LLM APIs during development
- Compare model performance before deployment
- Identify latency regressions in CI/CD
- Optimize prompt engineering for speed
- Continuous latency monitoring
- SLA compliance verification
- Cost optimization analysis
- Provider comparison for failover
- Academic research on LLM performance
- Benchmark new models and providers
- Analyze scaling characteristics
- Study geographic latency patterns
- Track API spending in real-time
- Compare cost vs. performance tradeoffs
- Project monthly costs based on usage
- Identify opportunities for optimization
We welcome contributions from the community! See our Contributing Guide for details on:
- Code of Conduct
- Development setup
- Pull request process
- Testing requirements
- Documentation standards
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
cargo test) - Run lints (
cargo clippy) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Priority Support: 24/7 support with SLA
- Custom Integrations: Tailored provider adapters
- On-Premise Deployment: Self-hosted solutions
- Training & Consulting: Expert guidance
- Advanced Analytics: Custom dashboards and reporting
- Enterprise Sales: enterprise@llm-devops.com
- Support: support@llm-devops.com
- Community: Discord
- ✅ Core timing engine
- ✅ OpenAI and Anthropic providers
- ✅ Streaming support
- ✅ Basic CLI
- ✅ JSON/CSV export
- 🔄 Google Gemini provider
- 🔄 Azure OpenAI support
- 🔄 Cohere integration
- 🔄 Prometheus metrics
- 🔄 Grafana dashboards
- 📋 Distributed execution
- 📋 Real-time dashboard
- 📋 Historical analysis
- 📋 AI-powered optimization
- 📋 Multi-region testing
See our full roadmap for details.
| Tool | Language | TTFT Accuracy | Streaming | Multi-Provider | Cost Tracking |
|---|---|---|---|---|---|
| LLM-Latency-Lens | Rust | ✅ Nanosecond | ✅ Yes | ✅ 5+ providers | ✅ Real-time |
| Tool A | Python | ❌ No | ❌ No | ||
| Tool B | Go | ✅ Microsecond | ✅ Yes | ||
| Tool C | Node.js | ❌ 1 provider | ❌ No |
Security is a top priority. See our Security Policy for:
- Vulnerability reporting process
- Security update policy
- API key handling best practices
- Audit logs and compliance
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Copyright 2024 LLM DevOps Team
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Built with these excellent open-source projects:
- Tokio - Async runtime
- Reqwest - HTTP client
- Quanta - High-precision timing
- HDRHistogram - Latency percentiles
- Clap - CLI framework
- GitHub Discussions: Ask questions and share ideas
- Discord: Join our community
- Twitter: @llmlatencylens
- Blog: Read our technical blog
If you find LLM-Latency-Lens useful, please consider giving us a star! ⭐
Made with ❤️ by the LLM DevOps Team