diff --git a/vllm_configs/README.md b/vllm_configs/README.md
new file mode 100644
index 0000000..86f2784
--- /dev/null
+++ b/vllm_configs/README.md
@@ -0,0 +1,166 @@
+# Launching vLLM endpoints
+
+This folder contains a Kubernetes deployment example (`k8.yaml`) and guidance
+for launching vLLM endpoints that can serve LALMs. 
+
+
+You can use any one of the two recommended approaches below:
+- Local: run a vLLM server on your workstation or VM (good for development).
+- Kubernetes: deploy the provided `k8.yaml` to a GPU-capable cluster.
+
+
+Keep these high-level notes in mind:
+- Do NOT commit real secrets (Hugging Face tokens) into source control. Use
+	Kubernetes Secrets or environment variables stored securely.
+- The `k8.yaml` file uses a placeholder image (`<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>`).
+	Replace that with an image that has required audio dependencies (ffmpeg, soundfile, librosa, torchaudio,
+	any model-specific libs) before applying.
+- The example exposes ports 8000..8007. If you only need a single instance,
+	reducing the number of containers/ports in the Pod is fine.
+
+
+
+**Useful links**
+
+- vLLM docs (overview & quickstart): https://docs.vllm.ai/en/latest/getting_started/quickstart/
+- vLLM CLI `serve` docs: https://docs.vllm.ai/en/latest/cli/serve/
+- vLLM Kubernetes / deployment docs: https://docs.vllm.ai/en/latest/deployment/k8s/
+- vLLM audio / multimodal docs and examples:
+	- Audio assets API: https://docs.vllm.ai/en/latest/api/vllm/assets/audio/
+	- Audio example (offline / language + audio): https://docs.vllm.ai/en/latest/examples/offline_inference/audio_language/
+
+These audio-specific links describe how vLLM handles audio assets, required
+dependencies and example code for audio-language workflows.
+
+
+
+
+
+
+## **A. Local (development)**
+
+1) Prerequisites
+
+- GPU node or a machine with a compatible PyTorch/CUDA setup (or CPU only for small models).
+- Python 3.10+ and a virtual environment is recommended.
+- A Hugging Face token with access to the model, set in `HUGGING_FACE_HUB_TOKEN`.
+
+2) Install vLLM (recommended minimal steps)
+
+```bash
+# create & activate a venv (example using uv as in vLLM docs, or use python -m venv)
+python -m venv .venv
+source .venv/bin/activate
+pip install --upgrade pip
+# install vllm and choose a torch backend if needed
+pip install vllm --upgrade
+
+# macOS (Homebrew):
+brew install ffmpeg libsndfile
+pip install soundfile librosa torchaudio
+
+# Ubuntu/Debian:
+sudo apt-get update && sudo apt-get install -y ffmpeg libsndfile1
+pip install soundfile librosa torchaudio
+```
+
+3) Start the server
+
+The vLLM CLI provides a `serve` entrypoint that starts an OpenAI-compatible HTTP
+server. Example:
+
+```bash
+# serve a HF model on localhost:8000
+export HUGGING_FACE_HUB_TOKEN="<YOUR_HF_TOKEN>"
+vllm serve microsoft/Phi-4-multimodal-instruct --port 8000 --host 0.0.0.0
+```
+
+Notes:
+- Use `--api-key` or set `VLLM_API_KEY` if you want the server to require an API key.
+- Many LALMs need additional Python packages or system
+	libraries. Commonly required packages: `soundfile`, `librosa`, `torchaudio`,
+	and system `ffmpeg`/`libsndfile`. The exact requirements depend on the model
+	and any tokenizer/preprocessor it uses. Check the model's Hugging Face page
+	and the vLLM audio docs linked above.
+- If you plan to use GPU acceleration, ensure a compatible PyTorch/CUDA
+	combination is installed in the environment (or use vLLM Docker images with
+	prebuilt CUDA support). If you run into missing symbols, check CUDA/PyTorch
+	compatibility and rebuild or pick a different image.
+
+4) Point `run_configs` to the local endpoint
+
+Update your run config to use the local server URL (example YAML snippet):
+
+```yaml
+# example run_configs entry
+# For OpenAI-compatible API calls use endpoints like /v1/completions or /v1/chat/completions
+url: "http://localhost:8000/v1/completions"
+```
+
+
+
+
+
+## **B. Kubernetes — use the provided `k8.yaml`**
+
+What the example does:
+
+- Launches a single Pod template containing multiple vLLM containers (ports 8000..8007).
+- Each container is configured with the same model and listens on a distinct port.
+- A `Service` of type `NodePort` exposes the Pod ports on the cluster nodes.
+
+Pre-apply checklist (LALMs)
+
+1. Replace the placeholder image in `k8.yaml`:
+
+	 - Find and replace `<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>` with an image
+		 that includes:
+		 - vLLM installed
+		 - Python audio libs used by your model: `soundfile`, `librosa`, `torchaudio`, etc.
+		 - System binaries: `ffmpeg` and `libsndfile` (or equivalents).
+
+2. Secrets: create a Kubernetes Secret for your Hugging Face token, e.g.:
+
+```bash
+kubectl -n <namespace> create secret generic hf-token \
+	--from-literal=HUGGING_FACE_HUB_TOKEN='<YOUR_HF_TOKEN>'
+```
+
+Then update `k8.yaml` container env to use `valueFrom.secretKeyRef` instead of a plain `value`.
+
+3. Cluster requirements
+
+- GPU-enabled nodes and drivers (matching the image / CUDA version)
+- If using Run:AI or a custom scheduler, ensure `schedulerName` matches your cluster. Remove
+	or edit `schedulerName` if not applicable.
+
+Apply the example
+
+```bash
+# make any replacements (image, secret references), then:
+kubectl apply -f vllm_configs/k8.yaml
+
+# monitor rollout
+kubectl -n <namespace> rollout status deployment/infer-phi4-multimodal-instruct
+kubectl -n <namespace> get pods -l app=infer-phi4-multimodal-instruct
+```
+
+Accessing the service
+
+- The `Service` in `k8.yaml` is `NodePort`. To see which node port range your cluster assigned,
+	run:
+
+```bash
+kubectl -n <namespace> get svc infer-phi4-multimodal-instruct-service -o wide
+```
+
+- You can then use `http://<node-ip>:<nodePort>` for the port you want (8000..8007 map to
+	cluster node ports). For production, consider exposing via `LoadBalancer` or an Ingress.
+
+
+
+Troubleshooting:
+- Check container logs: `kubectl -n <namespace> logs <pod> -c deployment0` (replace container name).
+- If model fails to load: check `HUGGING_FACE_HUB_TOKEN`, image CUDA/PyTorch compatibility, and
+	that `--trust_remote_code` is set only when you trust the model repo.
+
diff --git a/vllm_configs/k8.yaml b/vllm_configs/k8.yaml
new file mode 100644
index 0000000..a55abb2
--- /dev/null
+++ b/vllm_configs/k8.yaml
@@ -0,0 +1,516 @@
+############################################################
+# vLLM Kubernetes Deployment config
+#
+# This file defines a Kubernetes Deployment and Service to run
+# multiple vLLM containers (one per GPU/port) for serving the
+# `microsoft/Phi-4-multimodal-instruct` model using a vLLM-based
+# serving image. The Deployment creates 8 containers (deployment0
+# .. deployment7) within a single Pod template; each container
+# listens on its own port (8000..8007). The Service exposes
+# those ports as NodePort so the cluster nodes can receive traffic.
+#
+# Important notes:
+# - Replace the placeholder `HUGGING_FACE_HUB_TOKEN` values with a
+#   Kubernetes `Secret` or mount a token securely; avoid committing
+#   real tokens into source control.
+# - `replicas: 1` means one Pod; each Pod here contains multiple
+#   containers (one per device/port). If you want multiple worker
+#   Pods, increase `replicas` and ensure you have sufficient GPUs.
+# - Resource `requests`/`limits` are per-container. Adjust CPU,
+#   memory and `nvidia.com/gpu` counts to match your hardware.
+# - `--tensor-parallel-size` is currently set to `1`. If you want
+#   model parallelism across GPUs, change this and coordinate
+#   the deployment accordingly.
+# - Replace `<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>`
+#   with the appropriate image containing the necessary audio dependencies before running.
+############################################################
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: infer-phi4-multimodal-instruct
+  namespace: default
+spec:
+  replicas: 1 # Number of Pod replicas. Each Pod contains multiple containers (one per port/GPU).
+  selector:
+    matchLabels:
+      app: infer-phi4-multimodal-instruct
+  template:
+    metadata:
+      annotations:
+        sidecar.istio.io/inject: "false"
+      labels:
+        app: infer-phi4-multimodal-instruct
+    spec:
+      volumes:
+      - name: dshm
+        emptyDir:
+          medium: Memory
+
+      containers:
+      # The Pod runs multiple vLLM containers (deployment0..deployment7).
+      # Each container runs the same image with a different `--port` and
+      # container name. This pattern allows exposing multiple ports from
+      # a single Pod (useful when packing multiple GPUs on one node).
+      - name: deployment0
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8000",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8000
+          name: vllm-port-0
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          # IMPORTANT: Replace this with a reference to a Kubernetes Secret
+          # (e.g. mount a secret or use `valueFrom.secretKeyRef`) rather
+          # than committing tokens into source control.
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+      - name: deployment1
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8001",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8001
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8001
+          name: vllm-port-1
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+      - name: deployment2
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8002",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8002
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8002
+          name: vllm-port-2
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+      - name: deployment3
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8003",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8003
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8003
+          name: vllm-port-3
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+
+      - name: deployment4
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8004",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8004
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8004
+          name: vllm-port-4
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+
+      - name: deployment5
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8005",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8005
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8005
+          name: vllm-port-5
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+
+      - name: deployment6
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8006",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8006
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8006
+          name: vllm-port-6
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+      - name: deployment7
+        image: <YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>
+        args: 
+          [
+            "--model", 
+            "microsoft/Phi-4-multimodal-instruct", 
+            "--served-model-name",
+            "infer-phi4-multimodal-instruct",
+            "--dtype",
+            "auto",
+            "--disable-log-requests",
+            "--port",
+            "8007",
+            "--tokenizer-mode",
+            "auto",
+            "--max-model-len",
+            "32768",
+            "--tensor-parallel-size", 
+            "1",
+            "--gpu-memory-utilization",
+            "0.9",
+            "--trust_remote_code"
+          ]
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8007
+          initialDelaySeconds: 5
+          periodSeconds: 60
+        ports:
+        - containerPort: 8007
+          name: vllm-port-7
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          value: "<YOUR_HUGGING_FACE_TOKEN>"
+        resources:
+          requests:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+          limits:
+            cpu: "8"
+            memory: "64Gi"
+            nvidia.com/gpu: "1"
+        volumeMounts:
+          - name: dshm
+            mountPath: /dev/shm
+          
+
+
+      affinity:
+        podAntiAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:
+          - labelSelector:
+              matchExpressions:
+              - key: app
+                operator: In
+                values:
+                - infer-phi4-multimodal-instruct
+            # Prevent multiple Pods of this app from being scheduled on the
+            # same node where possible (spreads pods across nodes).
+            topologyKey: kubernetes.io/hostname
+
+
+---
+
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: infer-phi4-multimodal-instruct-service
+  namespace: default
+spec:
+  type: NodePort
+  ports:
+  - name: http-infer-phi4-multimodal-instruct-0
+    port: 8000
+    protocol: TCP
+    targetPort: 8000
+  - name: http-infer-phi4-multimodal-instruct-1
+    port: 8001
+    protocol: TCP
+    targetPort: 8001
+  - name: http-infer-phi4-multimodal-instruct-2
+    port: 8002
+    protocol: TCP
+    targetPort: 8002
+  - name: http-infer-phi4-multimodal-instruct-3
+    port: 8003
+    protocol: TCP
+    targetPort: 8003
+  - name: http-infer-phi4-multimodal-instruct-4
+    port: 8004
+    protocol: TCP
+    targetPort: 8004
+  - name: http-infer-phi4-multimodal-instruct-5
+    port: 8005
+    protocol: TCP
+    targetPort: 8005
+  - name: http-infer-phi4-multimodal-instruct-6
+    port: 8006
+    protocol: TCP
+    targetPort: 8006
+  - name: http-infer-phi4-multimodal-instruct-7
+    port: 8007
+    protocol: TCP
+    targetPort: 8007
+  selector:
+    app: infer-phi4-multimodal-instruct
\ No newline at end of file