diff --git a/vllm_configs/README.md b/vllm_configs/README.md new file mode 100644 index 0000000..86f2784 --- /dev/null +++ b/vllm_configs/README.md @@ -0,0 +1,166 @@ +# Launching vLLM endpoints + +This folder contains a Kubernetes deployment example (`k8.yaml`) and guidance +for launching vLLM endpoints that can serve LALMs. + + +You can use any one of the two recommended approaches below: +- Local: run a vLLM server on your workstation or VM (good for development). +- Kubernetes: deploy the provided `k8.yaml` to a GPU-capable cluster. + + +Keep these high-level notes in mind: +- Do NOT commit real secrets (Hugging Face tokens) into source control. Use + Kubernetes Secrets or environment variables stored securely. +- The `k8.yaml` file uses a placeholder image (``). + Replace that with an image that has required audio dependencies (ffmpeg, soundfile, librosa, torchaudio, + any model-specific libs) before applying. +- The example exposes ports 8000..8007. If you only need a single instance, + reducing the number of containers/ports in the Pod is fine. + + + +**Useful links** + +- vLLM docs (overview & quickstart): https://docs.vllm.ai/en/latest/getting_started/quickstart/ +- vLLM CLI `serve` docs: https://docs.vllm.ai/en/latest/cli/serve/ +- vLLM Kubernetes / deployment docs: https://docs.vllm.ai/en/latest/deployment/k8s/ +- vLLM audio / multimodal docs and examples: + - Audio assets API: https://docs.vllm.ai/en/latest/api/vllm/assets/audio/ + - Audio example (offline / language + audio): https://docs.vllm.ai/en/latest/examples/offline_inference/audio_language/ + +These audio-specific links describe how vLLM handles audio assets, required +dependencies and example code for audio-language workflows. + + + + + + +## **A. Local (development)** + +1) Prerequisites + +- GPU node or a machine with a compatible PyTorch/CUDA setup (or CPU only for small models). +- Python 3.10+ and a virtual environment is recommended. +- A Hugging Face token with access to the model, set in `HUGGING_FACE_HUB_TOKEN`. + +2) Install vLLM (recommended minimal steps) + +```bash +# create & activate a venv (example using uv as in vLLM docs, or use python -m venv) +python -m venv .venv +source .venv/bin/activate +pip install --upgrade pip +# install vllm and choose a torch backend if needed +pip install vllm --upgrade + +# macOS (Homebrew): +brew install ffmpeg libsndfile +pip install soundfile librosa torchaudio + +# Ubuntu/Debian: +sudo apt-get update && sudo apt-get install -y ffmpeg libsndfile1 +pip install soundfile librosa torchaudio +``` + +3) Start the server + +The vLLM CLI provides a `serve` entrypoint that starts an OpenAI-compatible HTTP +server. Example: + +```bash +# serve a HF model on localhost:8000 +export HUGGING_FACE_HUB_TOKEN="" +vllm serve microsoft/Phi-4-multimodal-instruct --port 8000 --host 0.0.0.0 +``` + +Notes: +- Use `--api-key` or set `VLLM_API_KEY` if you want the server to require an API key. +- Many LALMs need additional Python packages or system + libraries. Commonly required packages: `soundfile`, `librosa`, `torchaudio`, + and system `ffmpeg`/`libsndfile`. The exact requirements depend on the model + and any tokenizer/preprocessor it uses. Check the model's Hugging Face page + and the vLLM audio docs linked above. +- If you plan to use GPU acceleration, ensure a compatible PyTorch/CUDA + combination is installed in the environment (or use vLLM Docker images with + prebuilt CUDA support). If you run into missing symbols, check CUDA/PyTorch + compatibility and rebuild or pick a different image. + +4) Point `run_configs` to the local endpoint + +Update your run config to use the local server URL (example YAML snippet): + +```yaml +# example run_configs entry +# For OpenAI-compatible API calls use endpoints like /v1/completions or /v1/chat/completions +url: "http://localhost:8000/v1/completions" +``` + + + + + +## **B. Kubernetes — use the provided `k8.yaml`** + +What the example does: + +- Launches a single Pod template containing multiple vLLM containers (ports 8000..8007). +- Each container is configured with the same model and listens on a distinct port. +- A `Service` of type `NodePort` exposes the Pod ports on the cluster nodes. + +Pre-apply checklist (LALMs) + +1. Replace the placeholder image in `k8.yaml`: + + - Find and replace `` with an image + that includes: + - vLLM installed + - Python audio libs used by your model: `soundfile`, `librosa`, `torchaudio`, etc. + - System binaries: `ffmpeg` and `libsndfile` (or equivalents). + +2. Secrets: create a Kubernetes Secret for your Hugging Face token, e.g.: + +```bash +kubectl -n create secret generic hf-token \ + --from-literal=HUGGING_FACE_HUB_TOKEN='' +``` + +Then update `k8.yaml` container env to use `valueFrom.secretKeyRef` instead of a plain `value`. + +3. Cluster requirements + +- GPU-enabled nodes and drivers (matching the image / CUDA version) +- If using Run:AI or a custom scheduler, ensure `schedulerName` matches your cluster. Remove + or edit `schedulerName` if not applicable. + +Apply the example + +```bash +# make any replacements (image, secret references), then: +kubectl apply -f vllm_configs/k8.yaml + +# monitor rollout +kubectl -n rollout status deployment/infer-phi4-multimodal-instruct +kubectl -n get pods -l app=infer-phi4-multimodal-instruct +``` + +Accessing the service + +- The `Service` in `k8.yaml` is `NodePort`. To see which node port range your cluster assigned, + run: + +```bash +kubectl -n get svc infer-phi4-multimodal-instruct-service -o wide +``` + +- You can then use `http://:` for the port you want (8000..8007 map to + cluster node ports). For production, consider exposing via `LoadBalancer` or an Ingress. + + + +Troubleshooting: +- Check container logs: `kubectl -n logs -c deployment0` (replace container name). +- If model fails to load: check `HUGGING_FACE_HUB_TOKEN`, image CUDA/PyTorch compatibility, and + that `--trust_remote_code` is set only when you trust the model repo. + diff --git a/vllm_configs/k8.yaml b/vllm_configs/k8.yaml new file mode 100644 index 0000000..a55abb2 --- /dev/null +++ b/vllm_configs/k8.yaml @@ -0,0 +1,516 @@ +############################################################ +# vLLM Kubernetes Deployment config +# +# This file defines a Kubernetes Deployment and Service to run +# multiple vLLM containers (one per GPU/port) for serving the +# `microsoft/Phi-4-multimodal-instruct` model using a vLLM-based +# serving image. The Deployment creates 8 containers (deployment0 +# .. deployment7) within a single Pod template; each container +# listens on its own port (8000..8007). The Service exposes +# those ports as NodePort so the cluster nodes can receive traffic. +# +# Important notes: +# - Replace the placeholder `HUGGING_FACE_HUB_TOKEN` values with a +# Kubernetes `Secret` or mount a token securely; avoid committing +# real tokens into source control. +# - `replicas: 1` means one Pod; each Pod here contains multiple +# containers (one per device/port). If you want multiple worker +# Pods, increase `replicas` and ensure you have sufficient GPUs. +# - Resource `requests`/`limits` are per-container. Adjust CPU, +# memory and `nvidia.com/gpu` counts to match your hardware. +# - `--tensor-parallel-size` is currently set to `1`. If you want +# model parallelism across GPUs, change this and coordinate +# the deployment accordingly. +# - Replace `` +# with the appropriate image containing the necessary audio dependencies before running. +############################################################ + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: infer-phi4-multimodal-instruct + namespace: default +spec: + replicas: 1 # Number of Pod replicas. Each Pod contains multiple containers (one per port/GPU). + selector: + matchLabels: + app: infer-phi4-multimodal-instruct + template: + metadata: + annotations: + sidecar.istio.io/inject: "false" + labels: + app: infer-phi4-multimodal-instruct + spec: + volumes: + - name: dshm + emptyDir: + medium: Memory + + containers: + # The Pod runs multiple vLLM containers (deployment0..deployment7). + # Each container runs the same image with a different `--port` and + # container name. This pattern allows exposing multiple ports from + # a single Pod (useful when packing multiple GPUs on one node). + - name: deployment0 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8000", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8000 + name: vllm-port-0 + env: + - name: HUGGING_FACE_HUB_TOKEN + # IMPORTANT: Replace this with a reference to a Kubernetes Secret + # (e.g. mount a secret or use `valueFrom.secretKeyRef`) rather + # than committing tokens into source control. + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + - name: deployment1 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8001", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8001 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8001 + name: vllm-port-1 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + - name: deployment2 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8002", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8002 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8002 + name: vllm-port-2 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + - name: deployment3 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8003", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8003 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8003 + name: vllm-port-3 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + + - name: deployment4 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8004", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8004 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8004 + name: vllm-port-4 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + + - name: deployment5 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8005", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8005 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8005 + name: vllm-port-5 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + + - name: deployment6 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8006", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8006 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8006 + name: vllm-port-6 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + - name: deployment7 + image: + args: + [ + "--model", + "microsoft/Phi-4-multimodal-instruct", + "--served-model-name", + "infer-phi4-multimodal-instruct", + "--dtype", + "auto", + "--disable-log-requests", + "--port", + "8007", + "--tokenizer-mode", + "auto", + "--max-model-len", + "32768", + "--tensor-parallel-size", + "1", + "--gpu-memory-utilization", + "0.9", + "--trust_remote_code" + ] + readinessProbe: + httpGet: + path: /health + port: 8007 + initialDelaySeconds: 5 + periodSeconds: 60 + ports: + - containerPort: 8007 + name: vllm-port-7 + env: + - name: HUGGING_FACE_HUB_TOKEN + value: "" + resources: + requests: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + limits: + cpu: "8" + memory: "64Gi" + nvidia.com/gpu: "1" + volumeMounts: + - name: dshm + mountPath: /dev/shm + + + + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: app + operator: In + values: + - infer-phi4-multimodal-instruct + # Prevent multiple Pods of this app from being scheduled on the + # same node where possible (spreads pods across nodes). + topologyKey: kubernetes.io/hostname + + +--- + + +apiVersion: v1 +kind: Service +metadata: + name: infer-phi4-multimodal-instruct-service + namespace: default +spec: + type: NodePort + ports: + - name: http-infer-phi4-multimodal-instruct-0 + port: 8000 + protocol: TCP + targetPort: 8000 + - name: http-infer-phi4-multimodal-instruct-1 + port: 8001 + protocol: TCP + targetPort: 8001 + - name: http-infer-phi4-multimodal-instruct-2 + port: 8002 + protocol: TCP + targetPort: 8002 + - name: http-infer-phi4-multimodal-instruct-3 + port: 8003 + protocol: TCP + targetPort: 8003 + - name: http-infer-phi4-multimodal-instruct-4 + port: 8004 + protocol: TCP + targetPort: 8004 + - name: http-infer-phi4-multimodal-instruct-5 + port: 8005 + protocol: TCP + targetPort: 8005 + - name: http-infer-phi4-multimodal-instruct-6 + port: 8006 + protocol: TCP + targetPort: 8006 + - name: http-infer-phi4-multimodal-instruct-7 + port: 8007 + protocol: TCP + targetPort: 8007 + selector: + app: infer-phi4-multimodal-instruct \ No newline at end of file