Multi-GPU Inference Service Framework with Worker Pool Management.
- Multi-GPU Support: Automatic load balancing across multiple GPUs
- Automatic Device Detection: Detects CUDA GPUs if available, falls back to CPU
- Worker Pool Management: Configurable workers per device
- Async Architecture: Built on asyncio for high throughput
- HTTP Server/Client: aiohttp-based server with health checks
- Dragon/Asyncflow Integration: Optional HPC runtime support for distributed execution
- Metrics Collection: Real-time throughput and device utilization tracking
- Extensible: Base classes for adding new model types
# Basic installation
pip install -e .
# With ESM2 model support
pip install -e ".[esm2]"
# With Dragon/RADICAL support
pip install -e ".[dragon]"
# With development dependencies
pip install -e ".[dev]"
# Full installation
pip install -e ".[esm2,dragon,dev,plotting]"# Start server mode (with HTTP endpoints)
python example/esm2/run_esm2_inference.py --mode server --config_file example/esm2/config.yaml
# Run local inference (no server)
python example/esm2/run_esm2_inference.py --mode local --config_file example/esm2/config.yamlEdit example/esm2/config.yaml to configure:
# Model Settings
model_path: "facebook/esm2_t33_650M_UR50D"
# GPU Configuration
num_services: 1
num_gpus_per_service: 4
num_workers_per_gpu: 2
# Server Settings
server_port: 8000
# Batch Settings
num_batches: 200
max_batch_tokens: 16000
# Execution Settings
debug: true
engine: dragon # Enable Dragon HPC runtimespherical/
├── src/ # Core library
│ ├── inference_service.py # Base inference service + GPU workers
│ ├── server.py # HTTP server endpoints
│ ├── orchestrator.py # Multi-node coordination
│ ├── logger.py # Logging utilities
│ └── utils.py # Helper functions
├── example/
│ └── esm2/ # ESM2 example
│ ├── client.py # HTTP client with load balancing
│ ├── esm2_service.py # ESM2 service (re-export)
│ ├── run_esm2_inference.py # Entry point
│ └── config.yaml # Configuration
├── tests/ # Unit tests
└── doc/ # Documentation
Create a new service by extending InferenceService:
from src.inference_service import InferenceService
class MyModelService(InferenceService):
def _load_models(self):
"""Load your model onto GPUs."""
for device in self.devices:
self.models[device] = load_model().to(device)
def process_batch_sync(self, batch_id: int, device: str):
"""Run inference on a batch."""
model = self.models[device]
# Process batch...
self.reply_store[batch_id] = results
self.processed_queue.put_nowait(batch_id)
async def generate_batch(self) -> tuple:
"""Generate batches from input queue."""
seq = await self.input_queue.get()
if seq is None:
raise StopAsyncIteration
batch = tokenize(seq)
return len(batch), batchFor HPC environments, Spherical supports Dragon runtime with asyncflow:
# Enable in config.yaml
engine: dragon
dragon_workers: 100Run with Dragon:
dragon -w ssh --network-config slurm.yaml run_esm2_infern.pyPlot inference metrics:
python doc/plot_metrics.py --output_dir outputs# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=src --cov-report=html
# Lint and format code
ruff check .
ruff format .MIT License