Lean-Agentic Production Runbook

Quick Start

Build and Test

# Build all packages
cargo build --workspace --release

# Run all tests
cargo test --workspace

# Run benchmarks
cargo test --package benchmarks --release

# Run specific example
cargo run --package leanr-rag-gateway --example demo

Production Examples

1. RAG Gateway

# Run RAG gateway tests
cargo test --package leanr-rag-gateway

# Run with specific policy tests
cargo test --package leanr-rag-gateway policy

2. Finance Agent

# Compile finance example
rustc examples/finance/verified_finance_agent.rs --edition 2021

# Run finance tests (when integrated)
cargo test finance

3. Memory Copilot

# Compile memory copilot
rustc examples/memory-copilot/explainable_memory.rs --edition 2021

# Test recall precision
cargo test memory --test recall_precision

4. Trading Engine

# Compile trading example
rustc examples/trading/risk_bounded_trading.rs --edition 2021

# Test risk proofs
cargo test trading --test risk_proofs

5. Grid Operator

# Compile grid operator
rustc examples/grid-operator/safety_bounded_grid.rs --edition 2021

# Test safety proofs
cargo test grid --test safety_proofs

Performance Targets

Agent Coordination

Agent spawn: <1ms (P99)
Message throughput: 100K msg/s
Coordination overhead: <10ms (P99)

Compilation

Incremental compile: <100ms
Cache hit rate: >80%
Type checking: <50ms (typical)

Verification

Ledger verification: <10% overhead
Policy verification: <5% overhead
Proof verification: Zero GC impact

Cost

Task cost: $0.10-$1.00 per 1K tasks
Spot savings: 40-70% vs on-demand

Resilience

Recovery time: <5min
Availability: >95%

Monitoring

Key Metrics to Track

// Agent coordination
agent_spawn_latency_ms
message_throughput_per_sec
coordination_overhead_ms

// Compilation
compilation_time_ms
cache_hit_rate_percent
typecheck_time_ms

// Verification
ledger_verification_overhead_percent
policy_check_latency_ms
proof_generation_time_ms

// Cost
task_cost_usd
spot_instance_usage_percent
total_compute_cost_usd

// Resilience
recovery_time_seconds
availability_percent
error_rate

Health Checks

# System health
cargo check --workspace

# Test health
cargo test --workspace --no-fail-fast

# Benchmark health
cargo test --package benchmarks --release

Troubleshooting

High Latency

Symptom: P99 latency exceeds targets

Diagnosis:

cargo test --package benchmarks coordination::bench_agent_spawn
cargo test --package benchmarks coordination::bench_message_passing

Solutions:

Check for GC pressure
Verify WASM optimizations enabled
Review coordination topology
Check network latency

Low Cache Hit Rate

Symptom: Cache hit rate <80%

Diagnosis:

cargo test --package benchmarks compilation::bench_incremental_compile

Solutions:

Review module dependencies
Increase cache size
Check for cache invalidation bugs
Profile hot paths

High Verification Overhead

Symptom: Verification overhead >10%

Diagnosis:

cargo test --package benchmarks verification::bench_ledger_verification
cargo test --package benchmarks verification::bench_policy_verification

Solutions:

Optimize proof generation
Cache verification results
Batch verification operations
Review proof complexity

Cost Overruns

Symptom: Task cost >$1.00 per 1K

Diagnosis:

cargo test --package benchmarks cost::bench_task_cost
cargo test --package benchmarks cost::bench_spot_savings

Solutions:

Increase spot instance usage
Optimize resource allocation
Review task batching
Check for resource leaks

Recovery Issues

Symptom: Recovery time >5min

Diagnosis:

cargo test --package benchmarks chaos::bench_recovery_time
cargo test --package benchmarks chaos::bench_network_partition

Solutions:

Review failover logic
Check quorum configuration
Optimize state restoration
Verify health check intervals

Deployment

Pre-Deployment Checklist

Deployment Steps

Build Release
```
cargo build --workspace --release
```
Run Full Test Suite
```
cargo test --workspace --release
```

Run Benchmarks

cargo test --package benchmarks --release

Verify Targets

# Check all targets pass
cargo test --package benchmarks --release -- --nocapture | grep "PASS"

Deploy

# Deploy to staging first
./deploy.sh staging

# Smoke tests
./smoke-tests.sh staging

# Deploy to production
./deploy.sh production

Post-Deployment

Monitor key metrics for 15 minutes
Check error rates
Verify performance targets
Review audit logs
Confirm cost tracking

Rollback Procedure

If issues detected:

Immediate Rollback
```
./rollback.sh production
```
Verify Rollback
```
./smoke-tests.sh production
```

Investigate

# Review logs
tail -f logs/production.log

# Check metrics
cargo test --package benchmarks --release

Fix and Redeploy
- Fix issue
- Run full test suite
- Redeploy following normal procedure

Maintenance

Daily

Monitor error rates
Check performance metrics
Review audit logs
Verify cost tracking

Weekly

Run full benchmark suite
Review test coverage
Check for dependency updates
Audit security

Monthly

Chaos engineering tests
Performance regression analysis
Cost optimization review
Documentation updates

Emergency Procedures

Complete Outage

Assess
```
./health-check.sh
```
Emergency Stop
```
./emergency-stop.sh
```
Restore
```
./restore-from-backup.sh <timestamp>
```
Verify
```
./smoke-tests.sh production
```

Security Incident

Isolate
```
./isolate-affected-systems.sh
```

Audit

cargo test --package leanr-rag-gateway audit

Review Logs

./export-audit-logs.sh <start-time> <end-time>

Remediate
- Patch vulnerabilities
- Rotate credentials
- Update policies
- Redeploy

Data Corruption

Stop Writes
```
./pause-writes.sh
```

Verify Integrity

cargo test --package benchmarks verification::bench_ledger_verification

Restore
```
./restore-clean-state.sh <backup-id>
```
Verify
```
cargo test --workspace
```

Performance Optimization

Profiling

# CPU profiling
cargo flamegraph --test benchmark_suite

# Memory profiling
cargo instruments --template Allocations

# Benchmarking
cargo bench --package benchmarks

Optimization Checklist

Contact

On-Call: See PagerDuty rotation
Slack: #lean-agentic-alerts
Email: ops@lean-agentic.dev
Documentation: /workspaces/lean-agentic/docs/

Appendix

Useful Commands

# Quick health check
cargo check --workspace && cargo test --workspace

# Full benchmark suite
cargo test --package benchmarks --release -- --nocapture

# Specific benchmark category
cargo test --package benchmarks coordination --release

# Coverage report
cargo tarpaulin --workspace --out Html

# Security audit
cargo audit

# Dependency tree
cargo tree

# Clean build
cargo clean && cargo build --workspace --release

Log Locations

Application logs: /var/log/lean-agentic/app.log
Audit logs: /var/log/lean-agentic/audit.log
Performance logs: /var/log/lean-agentic/perf.log
Error logs: /var/log/lean-agentic/error.log

Configuration

Production: config/production.toml
Staging: config/staging.toml
Development: config/development.toml

Last Updated: 2025-10-25 Version: 1.0.0 Status: Production Ready ✅

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

Lean-Agentic Production Runbook

Quick Start

Build and Test

Production Examples

1. RAG Gateway

2. Finance Agent

3. Memory Copilot

4. Trading Engine

5. Grid Operator

Performance Targets

Agent Coordination

Compilation

Verification

Cost

Resilience

Monitoring

Key Metrics to Track

Health Checks

Troubleshooting

High Latency

Low Cache Hit Rate

High Verification Overhead

Cost Overruns

Recovery Issues

Deployment

Pre-Deployment Checklist

Deployment Steps

Post-Deployment

Rollback Procedure

Maintenance

Daily

Weekly

Monthly

Emergency Procedures

Complete Outage

Security Incident

Data Corruption

Performance Optimization

Profiling

Optimization Checklist

Contact

Appendix

Useful Commands

Log Locations

Configuration