Spherical

Multi-GPU Inference Service Framework with Worker Pool Management.

Features

Multi-GPU Support: Automatic load balancing across multiple GPUs
Automatic Device Detection: Detects CUDA GPUs if available, falls back to CPU
Worker Pool Management: Configurable workers per device
Async Architecture: Built on asyncio for high throughput
HTTP Server/Client: aiohttp-based server with health checks
Dragon/Asyncflow Integration: Optional HPC runtime support for distributed execution
Metrics Collection: Real-time throughput and device utilization tracking
Extensible: Base classes for adding new model types

Installation

# Basic installation
pip install -e .

# With ESM2 model support
pip install -e ".[esm2]"

# With Dragon/RADICAL support
pip install -e ".[dragon]"

# With development dependencies
pip install -e ".[dev]"

# Full installation
pip install -e ".[esm2,dragon,dev,plotting]"

Quick Start

Running the ESM2 Example

# Start server mode (with HTTP endpoints)
python example/esm2/run_esm2_inference.py --mode server --config_file example/esm2/config.yaml

# Run local inference (no server)
python example/esm2/run_esm2_inference.py --mode local --config_file example/esm2/config.yaml

Configuration

Edit example/esm2/config.yaml to configure:

# Model Settings
model_path: "facebook/esm2_t33_650M_UR50D"

# GPU Configuration
num_services: 1
num_gpus_per_service: 4
num_workers_per_gpu: 2

# Server Settings
server_port: 8000

# Batch Settings
num_batches: 200
max_batch_tokens: 16000

# Execution Settings
debug: true
engine: dragon      # Enable Dragon HPC runtime

Architecture

spherical/
├── src/                       # Core library
│   ├── inference_service.py   # Base inference service + GPU workers
│   ├── server.py              # HTTP server endpoints
│   ├── orchestrator.py        # Multi-node coordination
│   ├── logger.py              # Logging utilities
│   └── utils.py               # Helper functions
├── example/
│   └── esm2/                  # ESM2 example
│       ├── client.py          # HTTP client with load balancing
│       ├── esm2_service.py    # ESM2 service (re-export)
│       ├── run_esm2_inference.py  # Entry point
│       └── config.yaml        # Configuration
├── tests/                     # Unit tests
└── doc/                       # Documentation

Extending for New Models

Create a new service by extending InferenceService:

from src.inference_service import InferenceService

class MyModelService(InferenceService):
    def _load_models(self):
        """Load your model onto GPUs."""
        for device in self.devices:
            self.models[device] = load_model().to(device)

    def process_batch_sync(self, batch_id: int, device: str):
        """Run inference on a batch."""
        model = self.models[device]
        # Process batch...
        self.reply_store[batch_id] = results
        self.processed_queue.put_nowait(batch_id)

    async def generate_batch(self) -> tuple:
        """Generate batches from input queue."""
        seq = await self.input_queue.get()
        if seq is None:
            raise StopAsyncIteration
        batch = tokenize(seq)
        return len(batch), batch

Dragon/Asyncflow Support

For HPC environments, Spherical supports Dragon runtime with asyncflow:

# Enable in config.yaml
engine: dragon
dragon_workers: 100

Run with Dragon:

dragon -w ssh --network-config slurm.yaml run_esm2_infern.py

Metrics & Visualization

Two plotting scripts live in src/plot/:

Script	Input	Use case
`src/plot/plot_dragon.py`	Dragon telemetry JSON (`checkpoint_metadata` + `metrics[]`) and/or inference `metrics_*.json`	ESM2 inference runs, Dragon campaign runs
`src/plot/plot_nvml.py`	NVML telemetry JSON (`nvml_checkpoint_*.json`)	SGDES and any workflow using `NvmlMonitor`

Dragon telemetry — standalone mode

Plot GPU/CPU utilization from a single telemetry directory:

python src/plot/plot_dragon.py --telemetry-dir outputs/telemetry-results

Dragon telemetry — multi-run mode

Scan a parent directory for per-run output subdirectories, generate one plot per run and a throughput-vs-GPUs summary chart:

python src/plot/plot_dragon.py --output-dirs outputs --plots-dir plots

Each subdirectory may contain metrics_*.json (throughput timeseries) and/or a telemetry-results/ subdirectory (GPU/CPU utilization).

NVML telemetry

Plot GPU utilization and memory from NVML checkpoint files:

python src/plot/plot_nvml.py --telemetry-dir nvml-telemetry --output gpu_util.png

Prints a per-GPU summary table (mean/max utilization and memory) to stdout.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=src --cov-report=html

# Lint and format code
ruff check .
ruff format .

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spherical

Features

Installation

Quick Start

Running the ESM2 Example

Configuration

Architecture

Extending for New Models

Dragon/Asyncflow Support

Metrics & Visualization

Dragon telemetry — standalone mode

Dragon telemetry — multi-run mode

NVML telemetry

Development

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Spherical

Features

Installation

Quick Start

Running the ESM2 Example

Configuration

Architecture

Extending for New Models

Dragon/Asyncflow Support

Metrics & Visualization

Dragon telemetry — standalone mode

Dragon telemetry — multi-run mode

NVML telemetry

Development

License