Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion benchmarks/profiler/deploy/profile_sla_moe_job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ spec:
command: ["python", "-m", "benchmarks.profiler.profile_sla"]
args:
- --config
- /sgl-workspace/dynamo/recipes/deepseek-r1/sglang-wideep/tep16p-dep16d-disagg.yaml
- /sgl-workspace/dynamo/recipes/deepseek-r1/sglang/disagg-16gpu/deploy.yaml
- --output-dir
- /data/profiling_results
- --namespace
Expand Down
23 changes: 23 additions & 0 deletions recipes/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Recipes Contributing Guide

When adding new model recipes, ensure they follow the standard structure:
```text
<model-name>/
├── model-cache/
│ ├── model-cache.yaml
│ └── model-download.yaml
├── <framework>/
│ └── <deployment-mode>/
│ ├── deploy.yaml
│ └── perf.yaml (optional)
└── README.md (optional)
```

## Validation
The `run.sh` script expects this exact directory structure and will validate that the directories and files exist before deployment:
- Model directory exists in `recipes/<model>/`
- Framework is one of the supported frameworks (vllm, sglang, trtllm)
- Framework directory exists in `recipes/<model>/<framework>/`
- Deployment directory exists in `recipes/<model>/<framework>/<deployment>/`
- Required files (`deploy.yaml`) exist in the deployment directory
- If present, performance benchmarks (`perf.yaml`) will be automatically executed
283 changes: 240 additions & 43 deletions recipes/README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,285 @@
# Dynamo model serving recipes
# Dynamo Model Serving Recipes

| Model family | Backend | Mode | GPU | Deployment | Benchmark |
|---------------|---------|---------------------|-------|------------|-----------|
| llama-3-70b | vllm | agg | H100, H200 |||
| llama-3-70b | vllm | disagg-multi-node | H100, H200 |||
| llama-3-70b | vllm | disagg-single-node | H100, H200 |||
| DeepSeek-R1 | sglang | disaggregated | H200 || 🚧 |
| oss-gpt | trtllm | aggregated | GB200 |||
This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup.

## Contents
- [Available Models](#available-models)
- [Quick Start](#quick-start)
- [Prerequisites](#prerequisites)
- Deployment Methods
- [Option 1: Automated Deployment](#option-1-automated-deployment)
- [Option 2: Manual Deployment](#option-2-manual-deployment)


## Available Models

| Model Family | Framework | Deployment Mode | GPU Requirements | Status | Benchmark |
|-----------------|-----------|---------------------|------------------|--------|-----------|
| llama-3-70b | vllm | agg | 4x H100/H200 |||
| llama-3-70b | vllm | disagg (1 node) | 8x H100/H200 |||
| llama-3-70b | vllm | disagg (multi-node) | 16x H100/H200 |||
| deepseek-r1 | sglang | disagg (1 node, wide-ep) | 8x H200 || 🚧 |
| deepseek-r1 | sglang | disagg (multi-node, wide-ep) | 16x H200 || 🚧 |
| gpt-oss-120b | trtllm | agg | 4x GB200 |||

**Legend:**
- ✅ Functional
- 🚧 Under development


**Recipe Directory Structure:**
Recipes are organized into a directory structure that follows the pattern:
```text
<model-name>/
├── model-cache/
│ ├── model-cache.yaml # PVC for model cache
│ └── model-download.yaml # Job for model download
├── <framework>/
│ └── <deployment-mode>/
│ ├── deploy.yaml # DynamoGraphDeployment CRD and optional configmap for custom configuration
│ └── perf.yaml (optional) # Performance benchmark
└── README.md (optional) # Model documentation
```

## Quick Start

Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment.

Choose your preferred deployment method: using the `run.sh` script or manual deployment steps.


## Prerequisites

1. Create a namespace and populate NAMESPACE environment variable
This environment variable is used in later steps to deploy and perf-test the model.
### 1. Environment Setup

Create a Kubernetes namespace and set environment variable:

```bash
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
```

2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/kubernetes/README.md)
### 2. Deploy Dynamo Platform

Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md).

### 3. GPU Cluster

Ensure your Kubernetes cluster has:
- GPU nodes with appropriate GPU types (see model requirements above)
- GPU operator installed
- Sufficient GPU memory and compute resources

### 4. Container Registry Access

3. **Kubernetes cluster with GPU support**
Ensure access to NVIDIA container registry for runtime images:
- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z`
- `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z`
- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z`

4. **Container registry access** for vLLM runtime images
### 5. HuggingFace Access and Kubernetes Secret Creation

5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
Update the `hf-token-secret.yaml` file with your HuggingFace token.
Set up a kubernetes secret with the HuggingFace token for model download:

```bash
# Update the token in the secret file
vim hf_hub_secret/hf_hub_secret.yaml

# Apply the secret
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
```

6. (Optional) Create a shared model cache pvc to store the model weights.
Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file.
### 6. Configure Storage Class

Configure persistent storage for model caching:

```bash
# Check available storage classes
kubectl get storageclass
```

## Running the recipes
Replace "your-storage-class-name" with your actual storage class in the file: `<model>/model-cache/model-cache.yaml`

```yaml
# In <model>/model-cache/model-cache.yaml
spec:
storageClassName: "your-actual-storage-class" # Replace this
```
## Option 1: Automated Deployment
Run the recipe to deploy a model:
Use the `run.sh` script for fully automated deployment:

**Note:** The script automatically:
- Create model cache PVC and downloads the model
- Deploy the model service
- Runs performance benchmark if a `perf.yaml` file is present in the deployment directory


#### Script Usage

```bash
./run.sh --model <model> --framework <framework> <deployment-type>
./run.sh [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type>
```

Arguments:
<deployment-type> Deployment type (e.g., agg, disagg-single-node, disagg-multi-node)
**Required Options:**
- `--model <model>`: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1)
- `--framework <framework>`: Backend framework (`vllm`, `trtllm`, `sglang`)
- `--deployment <deployment-type>`: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node)

**Optional Options:**
- `--namespace <namespace>`: Kubernetes namespace (default: dynamo)
- `--dry-run`: Show commands without executing them
- `-h, --help`: Show help message

**Environment Variables:**
- `NAMESPACE`: Kubernetes namespace (default: dynamo)

#### Example Usage
```bash
# Set up environment
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
# Configure HuggingFace token
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
# use run.sh script to deploy the model
# Deploy Llama-3-70B with vLLM (aggregated mode)
./run.sh --model llama-3-70b --framework vllm --deployment agg
# Deploy GPT-OSS-120B with TensorRT-LLM
./run.sh --model gpt-oss-120b --framework trtllm --deployment agg
# Deploy DeepSeek-R1 with SGLang (disaggregated mode)
./run.sh --model deepseek-r1 --framework sglang --deployment disagg
# Deploy with custom namespace
./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg
# Dry run to see what would be executed
./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg
```

Required Options:
--model <model> Model name (e.g., llama-3-70b)
--framework <fw> Framework one of VLLM TRTLLM SGLANG (default: VLLM)

Optional:
--skip-model-cache Skip model downloading (assumes model cache already exists)
-h, --help Show this help message
## Option 2: Manual Deployment

Environment Variables:
NAMESPACE Kubernetes namespace (default: dynamo)
For step-by-step manual deployment follow these steps :

Examples:
./run.sh --model llama-3-70b --framework vllm agg
./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg
./run.sh --model llama-3-70b --framework trtllm disagg-single-node
Example:
```bash
./run.sh --model llama-3-70b --framework vllm --deployment-type agg
# 0. Set up environment (see Prerequisites section)
export NAMESPACE=your-namespace
kubectl create namespace ${NAMESPACE}
kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE}
# 1. Download model (see Model Download section)
kubectl apply -n $NAMESPACE -f <model>/model-cache/
# 2. Deploy model (see Deployment section)
kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/deploy.yaml
# 3. Run benchmarks (optional, if perf.yaml exists)
kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml
```

### Step 1: Download Model

```bash
# Start the download job
kubectl apply -n $NAMESPACE -f <model>/model-cache
# Verify job creation
kubectl get jobs -n $NAMESPACE | grep model-download
```

Monitor and wait for the model download to complete:

```bash
# Wait for job completion (timeout after 100 minutes)
kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s
# Check job status
kubectl get job model-download -n $NAMESPACE
# View download logs
kubectl logs job/model-download -n $NAMESPACE
```

### Step 2: Deploy Model Service

```bash
# Navigate to the specific deployment configuration
cd <model>/<framework>/<deployment-mode>/
# Deploy the model service
kubectl apply -n $NAMESPACE -f deploy.yaml
# Verify deployment creation
kubectl get deployments -n $NAMESPACE
```

#### Wait for Deployment Ready

## Dry run mode
```bash
# Get deployment name from the deploy.yaml file
DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}')
# Wait for deployment to be ready (timeout after 10 minutes)
kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s
# Check deployment status
kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE
# Check pod status
kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME
```

#### Verify Model Service

```bash
# Check if service is running
kubectl get services -n $NAMESPACE
# Test model endpoint (port-forward to test locally)
kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE
# Test the model API (in another terminal)
curl http://localhost:8000/v1/models
To dry run the recipe, add the `--dry-run` flag.
# Stop port-forward when done
pkill -f "kubectl port-forward"
```

### Step 3: Performance Benchmarking (Optional)

Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional):

#### Launch Benchmark Job

```bash
./run.sh --dry-run --model llama-3-70b --framework vllm agg
# From the deployment directory
kubectl apply -n $NAMESPACE -f perf.yaml
# Verify benchmark job creation
kubectl get jobs -n $NAMESPACE
```

## (Optional) Running the recipes with model cache
You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights.
See the [Prerequisites](#prerequisites) section for more details.
#### Monitor Benchmark Progress

```bash
./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache
# Get benchmark job name
PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}')
# Monitor benchmark logs in real-time
kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE
# Wait for benchmark completion (timeout after 100 minutes)
kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s
```

#### View Benchmark Results

```bash
# Check final benchmark results
kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50
```
Loading
Loading