Skip to content

Latest commit

 

History

History
67 lines (38 loc) · 1.93 KB

best_deployment_practices.md

File metadata and controls

67 lines (38 loc) · 1.93 KB

Best Deployment Practices

This document provides examples of best practices for deploying models using EMD for various use cases.

Famous Models

Qwen Series

QwQ-32B

emd deploy --model-id QwQ-32B --instance-type g5.12xlarge --engine-type vllm --service-type sagemaker

Deploying to Specific GPU Types

Choosing the right GPU type is critical for optimal performance and cost-efficiency. Use the --instance-type parameter to specify the GPU instance.

Example: Deploying Qwen2.5-7B on g5.2xlarge

emd deploy --model-id Qwen2.5-7B-Instruct --instance-type g5.2xlarge --engine-type vllm --service-type sagemaker

Achieving Longer Context Windows

To enable longer context windows, use the --extra-params option with engine-specific parameters.

Example: Deploying model with 16k context window

emd deploy --model-id Qwen2.5-7B-Instruct --instance-type g5.4xlarge --engine-type vllm --service-type sagemaker --extra-params '{
  "engine_params": {
    "vllm_cli_args": "--max_model_len 16000 --max_num_seqs 4"
  }
}'

Example: Deploying model on G4dn instance

emd deploy --model-id Qwen2.5-14B-Instruct-AWQ --instance-type g4dn.2xlarge --engine-type vllm --service-type sagemaker --extra-params '{
  "engine_params": {
    "environment_variables": "export VLLM_ATTENTION_BACKEND=XFORMERS && export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
    "default_cli_args": " --chat-template emd/models/chat_templates/qwen_2d5_add_prefill_chat_template.jinja --max_model_len 12000 --max_num_seqs 10  --gpu_memory_utilization 0.95 --disable-log-stats --enable-auto-tool-choice --tool-call-parser hermes"
  }
}'

Common Troubleshooting

If your deployment fails due to out-of-memory issues, try:

  • Using a larger instance type
  • Reducing max_model_len and max_num_seqs in the engine parameters
  • Setting a lower gpu_memory_utilization value (e.g., 0.8 instead of the default)