vLLM Weak Scaling Inference Test

This repository contains scripts for running weak scaling tests of vLLM inference on Aurora supercomputer using Ensemble Launcher.

Overview

The weak scaling test runs multiple independent instances of the same model across different nodes, with each node handling its own vLLM server and client.

Architecture

vLLM Server: Runs on each node with configurable GPU count
Client: Sends inference prompts to the local vLLM server
One-to-One Relation: Each client talks to its dedicated server

Project Structure

.
├── simple/                           # Simple mode: config-driven via el CLI
│   ├── submit_infer.sh               # PBS job submission script
│   ├── infer_no_ray.sh               # Worker script: launches server + client per node
│   ├── client.py                     # OpenAI client for sending prompts
│   └── create_ensemble_config.py     # Generates ensemble/system/launcher config JSONs
│
└── advanced/                         # Advanced mode: programmatic Python API
    ├── submit_infer.sh               # PBS job submission script
    ├── main.py                       # Orchestrator: model copy, server start, inference
    ├── start_vllm_server.sh          # Starts vLLM server on a node
    ├── stop_vllm_server.sh           # Stops vLLM server on a node
    ├── wait_for_vllm.py              # Polls until vLLM server is ready
    ├── download_model_hf.py          # Helper to download models from HuggingFace
    └── utils.py                      # Arg parsing, prompt submission, wait logic

Prerequisites

Environment: Activate your vLLM environment

source ~/.vllm_envs/inference/bin/activate

Ensemble Launcher (pull_workers branch): Clone and install

git clone https://github.com/argonne-lcf/ensemble_launcher.git
cd ensemble_launcher
git checkout pull_workers
pip install -e .
cd ..

Hugging Face Token: Set your token for model access
```
export HUGGINGFACE_HUB_TOKEN="your_token_here"
```
Model Cache: Ensure models are cached or use download_model_hf.py to fetch them

Modes

Simple Mode (`simple/`)

Config-driven approach using the el CLI. Each node runs infer_no_ray.sh which starts the vLLM server, waits for it to be ready, sends prompts via client.py, and cleans up.

1. Generate Ensemble Configuration

cd simple/

# Basic usage: 1 node, 8 GPUs per instance, 10 prompts
python create_ensemble_config.py --nnodes 1 --ngpus 8 --num-prompts 10

# Scale to multiple nodes
python create_ensemble_config.py --nnodes 4 --ngpus 8 --num-prompts 100

# Use a different model
python create_ensemble_config.py --nnodes 2 --model meta-llama/Llama-3.1-70B-Instruct --ngpus 4

Arguments:

--nnodes: Number of nodes (ensemble size)
--ngpus: GPUs per model instance
--model: Model name from HuggingFace
--num-prompts: Number of prompts each client sends

2. Submit PBS Job

Edit simple/submit_infer.sh to set PBS parameters, then submit:

qsub simple/submit_infer.sh

The job runs el ensemble_config.json ... which distributes infer_no_ray.sh across nodes.

Advanced Mode (`advanced/`)

Programmatic approach using the Ensemble Launcher Python API directly. main.py acts as the orchestrator: it starts a live EnsembleLauncher cluster, copies the model to node-local /tmp using dsync for faster I/O, starts one vLLM server per node, waits for readiness, then fans out prompts across the cluster using ClusterClient.map(), and finally shuts everything down.

Key differences from simple mode:

Dynamic task submission via ClusterClient — no pre-generated JSON configs
Model staging: copies model from shared filesystem to node-local /tmp using dsync before server start, reducing load on the parallel filesystem
Parallel prompt dispatch: uses client.map() to submit all prompts concurrently across nodes
Lifecycle management: programmatic start/stop of servers within a single job

1. Edit Job Parameters (optional)

The main.py reads arguments from the command line. Defaults are set in utils.py:

--model: meta-llama/Llama-3.1-8B-Instruct
--port: 8000
--num-prompts: 10
--ngpus-per-model: 8
--cache-dir: /lus/flare/projects/datascience/hari/vllm_inference/.cache
--tmp-dir: /tmp

2. Submit PBS Job

Edit advanced/submit_infer.sh to set PBS parameters (select, walltime, queue, account), then submit:

cd advanced/
qsub submit_infer.sh

The job runs python3 main.py [args] directly — no config generation step needed.

3. How `main.py` Works

Starts EnsembleLauncher with async_mpi children and async_processpool/async_mpi task executors
For each node: creates the local cache directory and runs dsync to copy the model from shared storage to /tmp
Starts one vLLM server per node via start_vllm_server.sh
Waits for all servers to respond (up to 1 hour)
Dispatches all prompts in parallel across the cluster via client.map(submit_prompt, ...)
Stops all vLLM servers via stop_vllm_server.sh
Stops the EnsembleLauncher cluster

Monitor Logs

Simple mode logs are organized by GPU count and model:

ls simple/logs_8gpus/meta-llama/Llama-3.1-8B-Instruct/

# View server logs
tail -f simple/logs_8gpus/meta-llama/Llama-3.1-8B-Instruct/1.vllm.server.log

# View client logs
tail -f simple/logs_8gpus/meta-llama/Llama-3.1-8B-Instruct/1.vllm.client.log

Advanced mode logs are written per hostname under <hostname>/logs_<N>gpus/<model>/.

Ensemble Launcher logs are written to logs/ in the current directory.

Environment Variables

Key environment variables set by the scripts:

HuggingFace:

HF_HOME, HF_HUB_CACHE: Cache directories
HF_TOKEN: Authentication token
HF_HUB_OFFLINE=1: Use cached models only (set after model is staged)

Intel/vLLM (Aurora-specific):

ZE_FLAT_DEVICE_HIERARCHY=FLAT
CCL_PROCESS_LAUNCHER=None
OCL_ICD_FILENAMES=libintelocl.so
VLLM_WORKER_MULTIPROC_METHOD=spawn

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
advanced		advanced
simple		simple
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Weak Scaling Inference Test

Overview

Architecture

Project Structure

Prerequisites

Modes

Simple Mode (`simple/`)

1. Generate Ensemble Configuration

2. Submit PBS Job

Advanced Mode (`advanced/`)

1. Edit Job Parameters (optional)

2. Submit PBS Job

3. How `main.py` Works

Monitor Logs

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Weak Scaling Inference Test

Overview

Architecture

Project Structure

Prerequisites

Modes

Simple Mode (simple/)

1. Generate Ensemble Configuration

2. Submit PBS Job

Advanced Mode (advanced/)

1. Edit Job Parameters (optional)

2. Submit PBS Job

3. How main.py Works

Monitor Logs

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Simple Mode (`simple/`)

Advanced Mode (`advanced/`)

3. How `main.py` Works

Packages