-
Couldn't load subscription status.
- Fork 660
fix: consistent model recipes and update simplified doc #3858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
a96738c
fix
biswapanda f778f0d
update docs
biswapanda 1dd9d7e
fix dsr1
biswapanda e0f9123
fix dsr1 and oss-gpt120b models
biswapanda 02b8b03
fix
biswapanda b90c5b6
update run.sh
biswapanda bc0b77a
update readme
biswapanda 406e7bc
update
biswapanda 1570665
fix
biswapanda 2a82e67
update
biswapanda 8940416
update
biswapanda 19924d8
update
biswapanda 28498af
fix tests
biswapanda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Recipes Contributing Guide | ||
|
|
||
| When adding new model recipes, ensure they follow the standard structure: | ||
| ```text | ||
| <model-name>/ | ||
| ├── model-cache/ | ||
| │ ├── model-cache.yaml | ||
| │ └── model-download.yaml | ||
| ├── <framework>/ | ||
| │ └── <deployment-mode>/ | ||
| │ ├── deploy.yaml | ||
| │ └── perf.yaml (optional) | ||
| └── README.md (optional) | ||
| ``` | ||
|
|
||
| ## Validation | ||
| The `run.sh` script expects this exact directory structure and will validate that the directories and files exist before deployment: | ||
| - Model directory exists in `recipes/<model>/` | ||
| - Framework is one of the supported frameworks (vllm, sglang, trtllm) | ||
| - Framework directory exists in `recipes/<model>/<framework>/` | ||
| - Deployment directory exists in `recipes/<model>/<framework>/<deployment>/` | ||
| - Required files (`deploy.yaml`) exist in the deployment directory | ||
| - If present, performance benchmarks (`perf.yaml`) will be automatically executed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,88 +1,285 @@ | ||
| # Dynamo model serving recipes | ||
| # Dynamo Model Serving Recipes | ||
|
|
||
| | Model family | Backend | Mode | GPU | Deployment | Benchmark | | ||
| |---------------|---------|---------------------|-------|------------|-----------| | ||
| | llama-3-70b | vllm | agg | H100, H200 | ✓ | ✓ | | ||
| | llama-3-70b | vllm | disagg-multi-node | H100, H200 | ✓ | ✓ | | ||
| | llama-3-70b | vllm | disagg-single-node | H100, H200 | ✓ | ✓ | | ||
| | DeepSeek-R1 | sglang | disaggregated | H200 | ✓ | 🚧 | | ||
| | oss-gpt | trtllm | aggregated | GB200 | ✓ | ✓ | | ||
| This repository contains production-ready recipes for deploying large language models using the Dynamo platform. Each recipe includes deployment configurations, performance benchmarking, and model caching setup. | ||
|
|
||
| ## Contents | ||
| - [Available Models](#available-models) | ||
| - [Quick Start](#quick-start) | ||
| - [Prerequisites](#prerequisites) | ||
| - Deployment Methods | ||
| - [Option 1: Automated Deployment](#option-1-automated-deployment) | ||
| - [Option 2: Manual Deployment](#option-2-manual-deployment) | ||
|
|
||
|
|
||
| ## Available Models | ||
|
|
||
| | Model Family | Framework | Deployment Mode | GPU Requirements | Status | Benchmark | | ||
| |-----------------|-----------|---------------------|------------------|--------|-----------| | ||
| | llama-3-70b | vllm | agg | 4x H100/H200 | ✅ | ✅ | | ||
| | llama-3-70b | vllm | disagg (1 node) | 8x H100/H200 | ✅ | ✅ | | ||
| | llama-3-70b | vllm | disagg (multi-node) | 16x H100/H200 | ✅ | ✅ | | ||
| | deepseek-r1 | sglang | disagg (1 node, wide-ep) | 8x H200 | ✅ | 🚧 | | ||
| | deepseek-r1 | sglang | disagg (multi-node, wide-ep) | 16x H200 | ✅ | 🚧 | | ||
| | gpt-oss-120b | trtllm | agg | 4x GB200 | ✅ | ✅ | | ||
|
|
||
| **Legend:** | ||
| - ✅ Functional | ||
| - 🚧 Under development | ||
|
|
||
|
|
||
| **Recipe Directory Structure:** | ||
| Recipes are organized into a directory structure that follows the pattern: | ||
| ```text | ||
| <model-name>/ | ||
| ├── model-cache/ | ||
| │ ├── model-cache.yaml # PVC for model cache | ||
| │ └── model-download.yaml # Job for model download | ||
| ├── <framework>/ | ||
| │ └── <deployment-mode>/ | ||
| │ ├── deploy.yaml # DynamoGraphDeployment CRD and optional configmap for custom configuration | ||
| │ └── perf.yaml (optional) # Performance benchmark | ||
| └── README.md (optional) # Model documentation | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Follow the instructions in the [Prerequisites](#prerequisites) section to set up your environment. | ||
|
|
||
| Choose your preferred deployment method: using the `run.sh` script or manual deployment steps. | ||
|
|
||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. Create a namespace and populate NAMESPACE environment variable | ||
| This environment variable is used in later steps to deploy and perf-test the model. | ||
| ### 1. Environment Setup | ||
|
|
||
| Create a Kubernetes namespace and set environment variable: | ||
|
|
||
| ```bash | ||
| export NAMESPACE=your-namespace | ||
| kubectl create namespace ${NAMESPACE} | ||
| ``` | ||
|
|
||
| 2. **Dynamo Cloud Platform installed** - Follow [Quickstart Guide](../docs/kubernetes/README.md) | ||
| ### 2. Deploy Dynamo Platform | ||
|
|
||
| Install the Dynamo Cloud Platform following the [Quickstart Guide](../docs/kubernetes/README.md). | ||
|
|
||
| ### 3. GPU Cluster | ||
|
|
||
| Ensure your Kubernetes cluster has: | ||
| - GPU nodes with appropriate GPU types (see model requirements above) | ||
| - GPU operator installed | ||
| - Sufficient GPU memory and compute resources | ||
|
|
||
| ### 4. Container Registry Access | ||
|
|
||
| 3. **Kubernetes cluster with GPU support** | ||
| Ensure access to NVIDIA container registry for runtime images: | ||
| - `nvcr.io/nvidia/ai-dynamo/vllm-runtime:x.y.z` | ||
| - `nvcr.io/nvidia/ai-dynamo/trtllm-runtime:x.y.z` | ||
| - `nvcr.io/nvidia/ai-dynamo/sglang-runtime:x.y.z` | ||
|
|
||
| 4. **Container registry access** for vLLM runtime images | ||
| ### 5. HuggingFace Access and Kubernetes Secret Creation | ||
|
|
||
| 5. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) | ||
| Update the `hf-token-secret.yaml` file with your HuggingFace token. | ||
| Set up a kubernetes secret with the HuggingFace token for model download: | ||
|
|
||
| ```bash | ||
| # Update the token in the secret file | ||
| vim hf_hub_secret/hf_hub_secret.yaml | ||
|
|
||
| # Apply the secret | ||
| kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | ||
| ``` | ||
|
|
||
| 6. (Optional) Create a shared model cache pvc to store the model weights. | ||
| Choose a storage class to create the model cache pvc. You'll need to use this storage class name to update the `storageClass` field in the model-cache/model-cache.yaml file. | ||
| ### 6. Configure Storage Class | ||
|
|
||
| Configure persistent storage for model caching: | ||
|
|
||
| ```bash | ||
| # Check available storage classes | ||
| kubectl get storageclass | ||
| ``` | ||
|
|
||
| ## Running the recipes | ||
| Replace "your-storage-class-name" with your actual storage class in the file: `<model>/model-cache/model-cache.yaml` | ||
|
|
||
| ```yaml | ||
| # In <model>/model-cache/model-cache.yaml | ||
| spec: | ||
| storageClassName: "your-actual-storage-class" # Replace this | ||
| ``` | ||
| ## Option 1: Automated Deployment | ||
| Run the recipe to deploy a model: | ||
| Use the `run.sh` script for fully automated deployment: | ||
|
|
||
| **Note:** The script automatically: | ||
| - Create model cache PVC and downloads the model | ||
| - Deploy the model service | ||
| - Runs performance benchmark if a `perf.yaml` file is present in the deployment directory | ||
|
|
||
|
|
||
| #### Script Usage | ||
|
|
||
| ```bash | ||
| ./run.sh --model <model> --framework <framework> <deployment-type> | ||
| ./run.sh [OPTIONS] --model <model> --framework <framework> --deployment <deployment-type> | ||
| ``` | ||
|
|
||
| Arguments: | ||
| <deployment-type> Deployment type (e.g., agg, disagg-single-node, disagg-multi-node) | ||
| **Required Options:** | ||
| - `--model <model>`: Model name matching the directory name in the recipes directory (e.g., llama-3-70b, gpt-oss-120b, deepseek-r1) | ||
| - `--framework <framework>`: Backend framework (`vllm`, `trtllm`, `sglang`) | ||
| - `--deployment <deployment-type>`: Deployment mode (e.g., agg, disagg, disagg-single-node, disagg-multi-node) | ||
|
|
||
| **Optional Options:** | ||
| - `--namespace <namespace>`: Kubernetes namespace (default: dynamo) | ||
| - `--dry-run`: Show commands without executing them | ||
| - `-h, --help`: Show help message | ||
|
|
||
| **Environment Variables:** | ||
| - `NAMESPACE`: Kubernetes namespace (default: dynamo) | ||
|
|
||
| #### Example Usage | ||
| ```bash | ||
| # Set up environment | ||
| export NAMESPACE=your-namespace | ||
| kubectl create namespace ${NAMESPACE} | ||
| # Configure HuggingFace token | ||
| kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | ||
| # use run.sh script to deploy the model | ||
| # Deploy Llama-3-70B with vLLM (aggregated mode) | ||
| ./run.sh --model llama-3-70b --framework vllm --deployment agg | ||
| # Deploy GPT-OSS-120B with TensorRT-LLM | ||
| ./run.sh --model gpt-oss-120b --framework trtllm --deployment agg | ||
| # Deploy DeepSeek-R1 with SGLang (disaggregated mode) | ||
| ./run.sh --model deepseek-r1 --framework sglang --deployment disagg | ||
| # Deploy with custom namespace | ||
| ./run.sh --namespace my-namespace --model llama-3-70b --framework vllm --deployment agg | ||
| # Dry run to see what would be executed | ||
| ./run.sh --dry-run --model llama-3-70b --framework vllm --deployment agg | ||
| ``` | ||
|
|
||
| Required Options: | ||
| --model <model> Model name (e.g., llama-3-70b) | ||
| --framework <fw> Framework one of VLLM TRTLLM SGLANG (default: VLLM) | ||
|
|
||
| Optional: | ||
| --skip-model-cache Skip model downloading (assumes model cache already exists) | ||
| -h, --help Show this help message | ||
| ## Option 2: Manual Deployment | ||
|
|
||
| Environment Variables: | ||
| NAMESPACE Kubernetes namespace (default: dynamo) | ||
| For step-by-step manual deployment follow these steps : | ||
|
|
||
| Examples: | ||
| ./run.sh --model llama-3-70b --framework vllm agg | ||
| ./run.sh --skip-model-cache --model llama-3-70b --framework vllm agg | ||
| ./run.sh --model llama-3-70b --framework trtllm disagg-single-node | ||
| Example: | ||
| ```bash | ||
| ./run.sh --model llama-3-70b --framework vllm --deployment-type agg | ||
| # 0. Set up environment (see Prerequisites section) | ||
| export NAMESPACE=your-namespace | ||
| kubectl create namespace ${NAMESPACE} | ||
| kubectl apply -f hf_hub_secret/hf_hub_secret.yaml -n ${NAMESPACE} | ||
| # 1. Download model (see Model Download section) | ||
| kubectl apply -n $NAMESPACE -f <model>/model-cache/ | ||
| # 2. Deploy model (see Deployment section) | ||
| kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/deploy.yaml | ||
| # 3. Run benchmarks (optional, if perf.yaml exists) | ||
| kubectl apply -n $NAMESPACE -f <model>/<framework>/<mode>/perf.yaml | ||
| ``` | ||
|
|
||
| ### Step 1: Download Model | ||
|
|
||
| ```bash | ||
| # Start the download job | ||
| kubectl apply -n $NAMESPACE -f <model>/model-cache | ||
| # Verify job creation | ||
| kubectl get jobs -n $NAMESPACE | grep model-download | ||
| ``` | ||
|
|
||
| Monitor and wait for the model download to complete: | ||
|
|
||
| ```bash | ||
| # Wait for job completion (timeout after 100 minutes) | ||
| kubectl wait --for=condition=Complete job/model-download -n $NAMESPACE --timeout=6000s | ||
| # Check job status | ||
| kubectl get job model-download -n $NAMESPACE | ||
| # View download logs | ||
| kubectl logs job/model-download -n $NAMESPACE | ||
| ``` | ||
|
|
||
| ### Step 2: Deploy Model Service | ||
|
|
||
| ```bash | ||
| # Navigate to the specific deployment configuration | ||
| cd <model>/<framework>/<deployment-mode>/ | ||
| # Deploy the model service | ||
| kubectl apply -n $NAMESPACE -f deploy.yaml | ||
| # Verify deployment creation | ||
| kubectl get deployments -n $NAMESPACE | ||
| ``` | ||
|
|
||
| #### Wait for Deployment Ready | ||
|
|
||
| ## Dry run mode | ||
| ```bash | ||
| # Get deployment name from the deploy.yaml file | ||
| DEPLOYMENT_NAME=$(grep "name:" deploy.yaml | head -1 | awk '{print $2}') | ||
| # Wait for deployment to be ready (timeout after 10 minutes) | ||
| kubectl wait --for=condition=available deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=1200s | ||
| # Check deployment status | ||
| kubectl get deployment $DEPLOYMENT_NAME -n $NAMESPACE | ||
| # Check pod status | ||
| kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT_NAME | ||
| ``` | ||
|
|
||
| #### Verify Model Service | ||
|
|
||
| ```bash | ||
| # Check if service is running | ||
| kubectl get services -n $NAMESPACE | ||
| # Test model endpoint (port-forward to test locally) | ||
| kubectl port-forward service/${DEPLOYMENT_NAME}-frontend 8000:8000 -n $NAMESPACE | ||
| # Test the model API (in another terminal) | ||
| curl http://localhost:8000/v1/models | ||
| To dry run the recipe, add the `--dry-run` flag. | ||
| # Stop port-forward when done | ||
| pkill -f "kubectl port-forward" | ||
| ``` | ||
|
|
||
| ### Step 3: Performance Benchmarking (Optional) | ||
|
|
||
| Run performance benchmarks to evaluate model performance. Note that benchmarking is only available for models that include a `perf.yaml` file (optional): | ||
|
|
||
| #### Launch Benchmark Job | ||
|
|
||
| ```bash | ||
| ./run.sh --dry-run --model llama-3-70b --framework vllm agg | ||
| # From the deployment directory | ||
| kubectl apply -n $NAMESPACE -f perf.yaml | ||
| # Verify benchmark job creation | ||
| kubectl get jobs -n $NAMESPACE | ||
| ``` | ||
|
|
||
| ## (Optional) Running the recipes with model cache | ||
| You may need to cache the model weights on a PVC to avoid repeated downloads of the model weights. | ||
| See the [Prerequisites](#prerequisites) section for more details. | ||
| #### Monitor Benchmark Progress | ||
|
|
||
| ```bash | ||
| ./run.sh --model llama-3-70b --framework vllm --deployment-type agg --skip-model-cache | ||
| # Get benchmark job name | ||
| PERF_JOB_NAME=$(grep "name:" perf.yaml | head -1 | awk '{print $2}') | ||
| # Monitor benchmark logs in real-time | ||
| kubectl logs -f job/$PERF_JOB_NAME -n $NAMESPACE | ||
| # Wait for benchmark completion (timeout after 100 minutes) | ||
| kubectl wait --for=condition=Complete job/$PERF_JOB_NAME -n $NAMESPACE --timeout=6000s | ||
| ``` | ||
|
|
||
| #### View Benchmark Results | ||
|
|
||
| ```bash | ||
| # Check final benchmark results | ||
| kubectl logs job/$PERF_JOB_NAME -n $NAMESPACE | tail -50 | ||
| ``` | ||
File renamed without changes.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.