Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions config/manifests/bbr-example/httproute_bbr_lora.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-llama-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'meta-llama/Llama-3.1-8B-Instruct'
timeouts:
request: 300s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-deepseek-route #give this HTTPRoute any name that helps you to group and track the matchers
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-deepseek-r1
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
name: X-Gateway-Model-Name
value: 'deepseek/vllm-deepseek-r1'
- path:
type: PathPrefix
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

@nirrozenbaum nirrozenbaum Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove these documentation comments?
looking at the HttpRoute, we don't really care where the header comes from (e.g., for testing purpose, a user may add this header manually).

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

name: X-Gateway-Model-Name
value: 'food-review'
- path:
type: PathPrefix
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

name: X-Gateway-Model-Name
value: 'movie-critique'
timeouts:
request: 300s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: vllm-llama3-8b-instruct-lora-food-review-1 #give this HTTPRoute any name that helps you to group and track the routes
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

name: X-Gateway-Model-Name
value: 'food-review-1' #this is the name of LoRA as defined in vLLM deployment
timeouts:
request: 300s
Comment on lines +67 to +91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not part of the first HttpRoute llm-llama-route that maps to InferencePool vllm-llama3-8b-instruct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have been part of it, but, as explained in the guide, I want to show two different ways of defining routes: using matchers on a route and using independently defined routes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation here for trying to show that this functionality can be configured in more than one way.
having said that, IMO this is more confusing than helping. from a newcomer point of view, I think it's not clear why for the deepseek pool base+LoRAs are configured in one HttpRoute Vs for llama it's configured differently.

we need to keep in mind that these quickstart guides serve as guiding principles for newcomers, how they should deploy and this should reflect implicitly our "recommendation".
I think we want to recommend using a single HttpRoute for a single pool and this is aligned with the new proposal, aiming to simplify the LoRA mapping with the ConfigMap so that we can have the route per pool automated through the helm chart deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirrozenbaum , my understanding is that one can use multiple HTTPRoutes to deal with the limitation on the number of matchers per single HTTPRoute.

Note that the manifest file uses two different ways of defining the routes to LoRAs: (1) via adding match clauses on the same base AI model HTTPRoute or by (2) defining a separate HTTPRoutes. There is no functional difference between the two methods, except for the limitation on the number of matchers per route imposed by the API Gateway

Admittedly, the intention here can and should be made more clear. I can elaborate on this issue and suggest that until the number of LoRAs for the base model is below the limit on the matchers per HTTPRoute, the preferred routing configuration is via a single HTTPRoute per EPP. If the number of LoRAs exceeds this limit, then one can define additional HTTPRoutes. I do not want to leave a user with the impression that she is more restricted than she actually is.
If you still think that simplicity is more important for this guide, and we only want a user to be exposed to routing configuration with a single HTTPRoute per EPP, I'll change the example accordingly.

Hope this makes sense. Please let me know.

44 changes: 44 additions & 0 deletions config/manifests/vllm/sim-deployment-1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deepseek-r1
spec:
replicas: 1
selector:
matchLabels:
app: vllm-deepseek-r1
template:
metadata:
labels:
app: vllm-deepseek-r1
spec:
containers:
- name: vllm-sim
image: ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
imagePullPolicy: Always
args:
- --model
- deepseek/vllm-deepseek-r1
- --port
- "8000"
- --max-loras
- "2"
- --lora-modules
- '{"name": "food-review"}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can call the lora adapters with completely different names to avoid confusion?
(the original deployment has food-review-1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still have food-review-1 in llama pool and food-review in deepseek pool..

- '{"name": "movie-critique"}'
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- containerPort: 8000
name: http
protocol: TCP
resources:
requests:
cpu: 10m
Loading