Every question is a doorway to deeper understanding
A: LLM-Runner-Router is a universal model orchestration system that provides format-agnostic model loading, intelligent routing, and seamless inference across different runtime environments (browser, Node.js, edge). It acts as a unified interface for multiple LLM formats and providers.
A: Currently supported formats include:
- GGUF/GGML (fully implemented)
- ONNX (planned)
- Safetensors (planned)
- HuggingFace Hub (planned)
- Custom formats (via plugin system)
A: The system is designed for production use with features like:
- Load balancing and fallback mechanisms
- Comprehensive security controls
- Performance monitoring and metrics
- Health checks and circuit breakers
- Container-ready deployment
However, check the current implementation status for specific features you need.
A: Yes, LLM-Runner-Router is open-source under the MIT license. You can use it freely for both commercial and non-commercial projects.
A: The router uses configurable strategies to select models:
// Available strategies
const strategies = [
'quality-first', // Best output quality
'cost-optimized', // Minimize costs
'speed-priority', // Fastest response
'balanced', // Balance all factors
'round-robin', // Equal distribution
'least-loaded', // Load balancing
'custom' // Your own logic
];
router.setStrategy('balanced');The router considers factors like model performance, cost, latency, and current load.
A: Yes, absolutely! The system supports:
// Model ensemble
const result = await router.ensemble([
{ model: 'gpt-4', weight: 0.5 },
{ model: 'claude', weight: 0.3 },
{ model: 'llama', weight: 0.2 }
], prompt);
// Parallel comparison
const comparison = await router.compare(
['model1', 'model2', 'model3'],
prompt
);A: Several strategies are available:
const router = new LLMRouter({
// Memory mapping - don't load entire model into RAM
memoryMapping: true,
// Quantization - reduce model size
quantization: 'q4_k_m',
// Lazy loading - load models on demand
lazyLoading: true,
// Unload unused models
unloadTimeout: 300000
});A: Performance varies by model and configuration:
| Model Size | Format | Engine | Typical Speed |
|---|---|---|---|
| 7B | GGUF q4_k_m | WebGPU | 50-100 tok/s |
| 7B | GGUF q4_k_m | WASM | 15-25 tok/s |
| 13B | GGUF q4_k_m | WebGPU | 25-50 tok/s |
First token latency is typically 50-200ms depending on model size and engine.
A: Yes! Once models are downloaded locally:
const router = new LLMRouter({
offline: true,
modelPaths: ['./local-models'],
autoDownload: false
});The system works entirely offline with local models.
A:
- Node.js: Version 16 or higher
- Memory: 4GB RAM minimum, 8GB+ recommended
- Storage: Depends on models (1-50GB per model)
- Browser: Modern browsers with WebAssembly support
For GPU acceleration:
- WebGPU: Chrome 113+, Firefox 115+
- CUDA: NVIDIA GPUs with CUDA 11.0+
A:
# Install
npm install llm-runner-router
# Basic usage
import LLMRouter from 'llm-runner-router';
const router = new LLMRouter();
await router.initialize();
// Load a model
const model = await router.load('llama-7b-q4_k_m.gguf');
// Generate text
const response = await router.quick('Hello, how are you?');
console.log(response.text);A: Models can be downloaded automatically or manually:
// Automatic download
const model = await router.load({
source: 'huggingface:microsoft/DialoGPT-medium',
autoDownload: true
});
// Manual download with progress
const download = router.download('llama-7b-q4_k_m.gguf', {
onProgress: (percent) => console.log(`${percent}% complete`)
});A: Use environment-specific configurations:
// Development
const devRouter = new LLMRouter({
strategy: 'speed-priority',
cache: { enabled: true, size: '100MB' },
logging: { level: 'debug' }
});
// Production
const prodRouter = new LLMRouter({
strategy: 'balanced',
cache: { enabled: true, size: '1GB' },
security: { authentication: true },
monitoring: { enabled: true }
});A: Yes, through the plugin system:
// Custom loader
class MyFormatLoader extends BaseLoader {
static format = 'myformat';
static extensions = ['.myext'];
async load(spec) {
// Your loading logic
}
}
// Register loader
router.registry.registerLoader('myformat', MyFormatLoader);A:
// Custom strategy
class SmartRouter {
route(models, context) {
// Your routing logic
const best = models.find(m =>
this.scoreModel(m, context) > threshold
);
return best;
}
}
router.addStrategy('smart', new SmartRouter());
router.setStrategy('smart');A: Yes! Browser support includes:
<!-- Include in HTML -->
<script src="https://cdn.jsdelivr.net/npm/llm-runner-router/dist/browser.js"></script>
<script>
const router = new LLMRouter({
preferredEngine: 'webgpu' // or 'wasm'
});
// Use WebGPU for acceleration if available
await router.initialize();
const response = await router.quick('Hello world!');
</script>A: The system handles browser security:
const router = new LLMRouter({
cors: {
allowedOrigins: ['https://yourapp.com'],
credentials: true
},
csp: {
enabled: true,
directives: {
'worker-src': "'self'",
'wasm-unsafe-eval': "'self'"
}
}
});A: Yes, for offline functionality:
// In service worker
importScripts('llm-runner-router-sw.js');
const router = new LLMRouter({
serviceWorker: true,
offline: true
});
// Handle inference requests offline
self.addEventListener('message', async (event) => {
if (event.data.type === 'inference') {
const result = await router.quick(event.data.prompt);
event.ports[0].postMessage(result);
}
});A: Several optimization strategies:
// 1. Use GPU acceleration
const router = new LLMRouter({
preferredEngine: 'webgpu'
});
// 2. Enable caching
const router = new LLMRouter({
cache: {
enabled: true,
maxSize: '1GB',
strategies: ['lru', 'lfu']
}
});
// 3. Batch requests
const results = await router.batch([
'Question 1',
'Question 2',
'Question 3'
]);
// 4. Use smaller models
const model = await router.load({
source: 'tinyllama.gguf', // Smaller but faster
quantization: 'q4_0'
});A:
// Enable metrics
const router = new LLMRouter({
metrics: {
enabled: true,
endpoint: '/metrics' // Prometheus format
}
});
// Get performance stats
const metrics = router.getMetrics();
console.log({
averageLatency: metrics.avgLatency,
tokensPerSecond: metrics.tokensPerSec,
cacheHitRate: metrics.cacheHitRate,
memoryUsage: metrics.memoryUsage
});A: Memory management options:
const router = new LLMRouter({
// Limit total memory usage
maxMemory: '4GB',
// Use memory mapping
memoryMapping: true,
// Unload unused models
autoUnload: true,
unloadTimeout: 600000, // 10 minutes
// Garbage collection tuning
gcSettings: {
maxOldGenerationSize: 4096, // 4GB
maxYoungGenerationSize: 512 // 512MB
}
});A: Several deployment options:
# Docker
docker run -p 3000:3000 -v ./models:/app/models llm-runner-router
# Kubernetes
kubectl apply -f k8s-deployment.yaml
# Docker Compose
docker-compose up -dA:
// Horizontal scaling with load balancer
const cluster = require('cluster');
const numCPUs = require('os').cpus().length;
if (cluster.isMaster) {
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
} else {
const router = new LLMRouter({
clustered: true,
processId: process.pid
});
// Start server
}A:
const router = new LLMRouter({
// Multiple model instances
fallbackModels: ['primary-model', 'backup-model'],
// Circuit breaker
circuitBreaker: {
enabled: true,
failureThreshold: 5,
resetTimeout: 60000
},
// Health checks
healthCheck: {
enabled: true,
interval: 30000
}
});A: Check these common causes:
// 1. Verify file exists and permissions
import fs from 'fs';
if (!fs.existsSync('model.gguf')) {
console.error('Model file not found');
}
// 2. Check available memory
const usage = process.memoryUsage();
console.log('Available heap:', usage.heapTotal - usage.heapUsed);
// 3. Try with debug logging
const router = new LLMRouter({
debug: true,
logLevel: 'debug'
});A:
// Check WebGPU support
if (!navigator.gpu) {
console.log('WebGPU not supported, falling back to WASM');
const router = new LLMRouter({
preferredEngine: 'wasm'
});
}
// Or use automatic fallback
const router = new LLMRouter({
autoFallback: true,
enginePriority: ['webgpu', 'wasm']
});A:
// Monitor memory usage
process.on('memoryUsage', (usage) => {
console.log('Memory usage:', {
rss: Math.round(usage.rss / 1024 / 1024) + 'MB',
heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB'
});
});
// Set memory limits
const router = new LLMRouter({
maxMemory: '4GB',
memoryWarningThreshold: '3GB'
});
// Handle out of memory
process.on('uncaughtException', (err) => {
if (err.code === 'ENOMEM') {
console.error('Out of memory - consider smaller model or quantization');
process.exit(1);
}
});A:
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: Questions and community support
- Documentation: Comprehensive docs and examples
- Email: echoaisystems@gmail.com for direct support
A:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
- Follow the contribution guidelines
A: Current priorities include:
- Additional model format support (ONNX, Safetensors)
- More runtime engines
- Enhanced caching mechanisms
- Better monitoring and observability
- Plugin ecosystem development
A: Yes, commercial support options include:
- Priority support channels
- Custom integration services
- Professional training
- Enterprise consulting
- SLA guarantees
"Questions are the engines of intellect" - Frank Kingdon
Keep asking, keep learning, keep building 🚀
Built with 💙 by Echo AI Systems