Skip to content

Latest commit

 

History

History
523 lines (423 loc) · 11.8 KB

File metadata and controls

523 lines (423 loc) · 11.8 KB

❓ FAQ - Questions from Across the Multiverse

Every question is a doorway to deeper understanding

General Questions

Q: What is LLM-Runner-Router?

A: LLM-Runner-Router is a universal model orchestration system that provides format-agnostic model loading, intelligent routing, and seamless inference across different runtime environments (browser, Node.js, edge). It acts as a unified interface for multiple LLM formats and providers.

Q: Which model formats are supported?

A: Currently supported formats include:

  • GGUF/GGML (fully implemented)
  • ONNX (planned)
  • Safetensors (planned)
  • HuggingFace Hub (planned)
  • Custom formats (via plugin system)

Q: Can I use this in production?

A: The system is designed for production use with features like:

  • Load balancing and fallback mechanisms
  • Comprehensive security controls
  • Performance monitoring and metrics
  • Health checks and circuit breakers
  • Container-ready deployment

However, check the current implementation status for specific features you need.

Q: Is it free to use?

A: Yes, LLM-Runner-Router is open-source under the MIT license. You can use it freely for both commercial and non-commercial projects.

Technical Questions

Q: How does the routing system work?

A: The router uses configurable strategies to select models:

// Available strategies
const strategies = [
  'quality-first',    // Best output quality
  'cost-optimized',   // Minimize costs
  'speed-priority',   // Fastest response
  'balanced',         // Balance all factors
  'round-robin',      // Equal distribution
  'least-loaded',     // Load balancing
  'custom'            // Your own logic
];

router.setStrategy('balanced');

The router considers factors like model performance, cost, latency, and current load.

Q: Can I use multiple models simultaneously?

A: Yes, absolutely! The system supports:

// Model ensemble
const result = await router.ensemble([
  { model: 'gpt-4', weight: 0.5 },
  { model: 'claude', weight: 0.3 },
  { model: 'llama', weight: 0.2 }
], prompt);

// Parallel comparison
const comparison = await router.compare(
  ['model1', 'model2', 'model3'],
  prompt
);

Q: How do I handle large models that don't fit in memory?

A: Several strategies are available:

const router = new LLMRouter({
  // Memory mapping - don't load entire model into RAM
  memoryMapping: true,
  
  // Quantization - reduce model size
  quantization: 'q4_k_m',
  
  // Lazy loading - load models on demand
  lazyLoading: true,
  
  // Unload unused models
  unloadTimeout: 300000
});

Q: What's the performance like?

A: Performance varies by model and configuration:

Model Size Format Engine Typical Speed
7B GGUF q4_k_m WebGPU 50-100 tok/s
7B GGUF q4_k_m WASM 15-25 tok/s
13B GGUF q4_k_m WebGPU 25-50 tok/s

First token latency is typically 50-200ms depending on model size and engine.

Q: Can I run this offline?

A: Yes! Once models are downloaded locally:

const router = new LLMRouter({
  offline: true,
  modelPaths: ['./local-models'],
  autoDownload: false
});

The system works entirely offline with local models.

Setup and Installation

Q: What are the minimum system requirements?

A:

  • Node.js: Version 16 or higher
  • Memory: 4GB RAM minimum, 8GB+ recommended
  • Storage: Depends on models (1-50GB per model)
  • Browser: Modern browsers with WebAssembly support

For GPU acceleration:

  • WebGPU: Chrome 113+, Firefox 115+
  • CUDA: NVIDIA GPUs with CUDA 11.0+

Q: How do I install and get started?

A:

# Install
npm install llm-runner-router

# Basic usage
import LLMRouter from 'llm-runner-router';

const router = new LLMRouter();
await router.initialize();

// Load a model
const model = await router.load('llama-7b-q4_k_m.gguf');

// Generate text
const response = await router.quick('Hello, how are you?');
console.log(response.text);

Q: How do I download models?

A: Models can be downloaded automatically or manually:

// Automatic download
const model = await router.load({
  source: 'huggingface:microsoft/DialoGPT-medium',
  autoDownload: true
});

// Manual download with progress
const download = router.download('llama-7b-q4_k_m.gguf', {
  onProgress: (percent) => console.log(`${percent}% complete`)
});

Configuration and Customization

Q: How do I configure for different environments?

A: Use environment-specific configurations:

// Development
const devRouter = new LLMRouter({
  strategy: 'speed-priority',
  cache: { enabled: true, size: '100MB' },
  logging: { level: 'debug' }
});

// Production
const prodRouter = new LLMRouter({
  strategy: 'balanced',
  cache: { enabled: true, size: '1GB' },
  security: { authentication: true },
  monitoring: { enabled: true }
});

Q: Can I add custom models or formats?

A: Yes, through the plugin system:

// Custom loader
class MyFormatLoader extends BaseLoader {
  static format = 'myformat';
  static extensions = ['.myext'];
  
  async load(spec) {
    // Your loading logic
  }
}

// Register loader
router.registry.registerLoader('myformat', MyFormatLoader);

Q: How do I implement custom routing logic?

A:

// Custom strategy
class SmartRouter {
  route(models, context) {
    // Your routing logic
    const best = models.find(m => 
      this.scoreModel(m, context) > threshold
    );
    return best;
  }
}

router.addStrategy('smart', new SmartRouter());
router.setStrategy('smart');

Browser Usage

Q: Does it work in browsers?

A: Yes! Browser support includes:

<!-- Include in HTML -->
<script src="https://cdn.jsdelivr.net/npm/llm-runner-router/dist/browser.js"></script>

<script>
const router = new LLMRouter({
  preferredEngine: 'webgpu'  // or 'wasm'
});

// Use WebGPU for acceleration if available
await router.initialize();
const response = await router.quick('Hello world!');
</script>

Q: What about CORS and security in browsers?

A: The system handles browser security:

const router = new LLMRouter({
  cors: {
    allowedOrigins: ['https://yourapp.com'],
    credentials: true
  },
  csp: {
    enabled: true,
    directives: {
      'worker-src': "'self'",
      'wasm-unsafe-eval': "'self'"
    }
  }
});

Q: Can I use Service Workers?

A: Yes, for offline functionality:

// In service worker
importScripts('llm-runner-router-sw.js');

const router = new LLMRouter({
  serviceWorker: true,
  offline: true
});

// Handle inference requests offline
self.addEventListener('message', async (event) => {
  if (event.data.type === 'inference') {
    const result = await router.quick(event.data.prompt);
    event.ports[0].postMessage(result);
  }
});

Performance and Optimization

Q: How can I improve performance?

A: Several optimization strategies:

// 1. Use GPU acceleration
const router = new LLMRouter({
  preferredEngine: 'webgpu'
});

// 2. Enable caching
const router = new LLMRouter({
  cache: {
    enabled: true,
    maxSize: '1GB',
    strategies: ['lru', 'lfu']
  }
});

// 3. Batch requests
const results = await router.batch([
  'Question 1',
  'Question 2', 
  'Question 3'
]);

// 4. Use smaller models
const model = await router.load({
  source: 'tinyllama.gguf',  // Smaller but faster
  quantization: 'q4_0'
});

Q: How do I monitor performance?

A:

// Enable metrics
const router = new LLMRouter({
  metrics: { 
    enabled: true,
    endpoint: '/metrics'  // Prometheus format
  }
});

// Get performance stats
const metrics = router.getMetrics();
console.log({
  averageLatency: metrics.avgLatency,
  tokensPerSecond: metrics.tokensPerSec,
  cacheHitRate: metrics.cacheHitRate,
  memoryUsage: metrics.memoryUsage
});

Q: What about memory usage?

A: Memory management options:

const router = new LLMRouter({
  // Limit total memory usage
  maxMemory: '4GB',
  
  // Use memory mapping
  memoryMapping: true,
  
  // Unload unused models
  autoUnload: true,
  unloadTimeout: 600000,  // 10 minutes
  
  // Garbage collection tuning
  gcSettings: {
    maxOldGenerationSize: 4096,  // 4GB
    maxYoungGenerationSize: 512   // 512MB
  }
});

Deployment and Production

Q: How do I deploy to production?

A: Several deployment options:

# Docker
docker run -p 3000:3000 -v ./models:/app/models llm-runner-router

# Kubernetes
kubectl apply -f k8s-deployment.yaml

# Docker Compose
docker-compose up -d

Q: How do I handle scaling?

A:

// Horizontal scaling with load balancer
const cluster = require('cluster');
const numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
} else {
  const router = new LLMRouter({
    clustered: true,
    processId: process.pid
  });
  // Start server
}

Q: What about high availability?

A:

const router = new LLMRouter({
  // Multiple model instances
  fallbackModels: ['primary-model', 'backup-model'],
  
  // Circuit breaker
  circuitBreaker: {
    enabled: true,
    failureThreshold: 5,
    resetTimeout: 60000
  },
  
  // Health checks
  healthCheck: {
    enabled: true,
    interval: 30000
  }
});

Troubleshooting

Q: Common error: "Model loading failed"

A: Check these common causes:

// 1. Verify file exists and permissions
import fs from 'fs';
if (!fs.existsSync('model.gguf')) {
  console.error('Model file not found');
}

// 2. Check available memory
const usage = process.memoryUsage();
console.log('Available heap:', usage.heapTotal - usage.heapUsed);

// 3. Try with debug logging
const router = new LLMRouter({ 
  debug: true,
  logLevel: 'debug' 
});

Q: "WebGPU not supported" error

A:

// Check WebGPU support
if (!navigator.gpu) {
  console.log('WebGPU not supported, falling back to WASM');
  const router = new LLMRouter({
    preferredEngine: 'wasm'
  });
}

// Or use automatic fallback
const router = new LLMRouter({
  autoFallback: true,
  enginePriority: ['webgpu', 'wasm']
});

Q: Memory errors and crashes

A:

// Monitor memory usage
process.on('memoryUsage', (usage) => {
  console.log('Memory usage:', {
    rss: Math.round(usage.rss / 1024 / 1024) + 'MB',
    heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB'
  });
});

// Set memory limits
const router = new LLMRouter({
  maxMemory: '4GB',
  memoryWarningThreshold: '3GB'
});

// Handle out of memory
process.on('uncaughtException', (err) => {
  if (err.code === 'ENOMEM') {
    console.error('Out of memory - consider smaller model or quantization');
    process.exit(1);
  }
});

Community and Support

Q: Where can I get help?

A:

  • GitHub Issues: Bug reports and feature requests
  • GitHub Discussions: Questions and community support
  • Documentation: Comprehensive docs and examples
  • Email: echoaisystems@gmail.com for direct support

Q: How can I contribute?

A:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request
  5. Follow the contribution guidelines

Q: What's the development roadmap?

A: Current priorities include:

  • Additional model format support (ONNX, Safetensors)
  • More runtime engines
  • Enhanced caching mechanisms
  • Better monitoring and observability
  • Plugin ecosystem development

Q: Can I get commercial support?

A: Yes, commercial support options include:

  • Priority support channels
  • Custom integration services
  • Professional training
  • Enterprise consulting
  • SLA guarantees

"Questions are the engines of intellect" - Frank Kingdon

Keep asking, keep learning, keep building 🚀

Built with 💙 by Echo AI Systems