Extending the guide with multiple LoRAs serving with BBR #1859

davidbreitgand · 2025-11-13T18:43:09Z

kind/documentation
Closes #1858

Extends documentation

…serve multiple LoRAs (many LoRAs per one model while having multiple models)

netlify · 2025-11-13T18:43:15Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`dbb93bc`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/692de3442273260008bce37a
😎 Deploy Preview	https://deploy-preview-1859--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-11-13T18:43:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: davidbreitgand
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-11-13T18:43:19Z

Hi @davidbreitgand. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

elevran · 2025-11-17T13:18:34Z

/ok-to-test

config/manifests/bbr-example/httproute_bbr_lora.yaml

nirrozenbaum · 2025-11-19T12:23:47Z

config/manifests/bbr-example/httproute_bbr_lora.yaml

+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: vllm-llama3-8b-instruct-lora-food-review-1 #give this HTTPRoute any name that helps you to group and track the routes
+spec:
+  parentRefs:
+  - group: gateway.networking.k8s.io
+    kind: Gateway
+    name: inference-gateway
+  rules:
+  - backendRefs:
+    - group: inference.networking.k8s.io
+      kind: InferencePool
+      name: vllm-llama3-8b-instruct
+    matches:
+    - path:
+        type: PathPrefix
+        value: /
+      headers:
+        - type: Exact
+          #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
+          name: X-Gateway-Model-Name 
+          value: 'food-review-1'   #this is the name of LoRA as defined in vLLM deployment
+    timeouts:
+      request: 300s


why this is not part of the first HttpRoute llm-llama-route that maps to InferencePool vllm-llama3-8b-instruct?

This could have been part of it, but, as explained in the guide, I want to show two different ways of defining routes: using matchers on a route and using independently defined routes.

I understand the motivation here for trying to show that this functionality can be configured in more than one way.
having said that, IMO this is more confusing than helping. from a newcomer point of view, I think it's not clear why for the deepseek pool base+LoRAs are configured in one HttpRoute Vs for llama it's configured differently.

we need to keep in mind that these quickstart guides serve as guiding principles for newcomers, how they should deploy and this should reflect implicitly our "recommendation".
I think we want to recommend using a single HttpRoute for a single pool and this is aligned with the new proposal, aiming to simplify the LoRA mapping with the ConfigMap so that we can have the route per pool automated through the helm chart deployment.

@nirrozenbaum , my understanding is that one can use multiple HTTPRoutes to deal with the limitation on the number of matchers per single HTTPRoute.

Note that the manifest file uses two different ways of defining the routes to LoRAs: (1) via adding match clauses on the same base AI model HTTPRoute or by (2) defining a separate HTTPRoutes. There is no functional difference between the two methods, except for the limitation on the number of matchers per route imposed by the API Gateway

Admittedly, the intention here can and should be made more clear. I can elaborate on this issue and suggest that until the number of LoRAs for the base model is below the limit on the matchers per HTTPRoute, the preferred routing configuration is via a single HTTPRoute per EPP. If the number of LoRAs exceeds this limit, then one can define additional HTTPRoutes. I do not want to leave a user with the impression that she is more restricted than she actually is.
If you still think that simplicity is more important for this guide, and we only want a user to be exposed to routing configuration with a single HTTPRoute per EPP, I'll change the example accordingly.

Hope this makes sense. Please let me know.

nirrozenbaum · 2025-11-19T13:30:20Z

config/manifests/vllm/sim-deployment-1.yaml

+        - --max-loras
+        - "2"
+        - --lora-modules
+        - '{"name": "food-review"}'


maybe we can call the lora adapters with completely different names to avoid confusion?
(the original deployment has food-review-1).

we still have food-review-1 in llama pool and food-review in deepseek pool..

elevran · 2025-11-19T15:43:59Z

@davidbreitgand minor addition (letting @nirrozenbaum drive the review)
please consider changing the documentation comment on the PR's description to better reflect the change from a user's perspective.

kfswain · 2025-11-20T00:57:29Z

site-src/guides/serve-multiple-genai-models.md

+### Serving multiple LoRAs per base AI model
+
+<div style="border: 1px solid red; padding: 10px; border-radius: 5px;">
+⚠️ Known Limitation : LoRA names must be unique across the base AI models (i.e., across the backend inference server deployments)


Known limitation almost implies its wrong in some way... can we just drop the limitation part?

If rephrased as "requirement", would this be Ok? I'd rather be explicit about this than let a user try and fail. Thoughts?

Rephrased to avoid a negative connotation.

This still appears as know limitation

kfswain · 2025-11-20T01:17:32Z

site-src/guides/serve-multiple-genai-models.md

+<div style="border: 1px solid red; padding: 10px; border-radius: 5px;">
+⚠️ Known Limitation : 
+
+[Kubernetes API Gateway limits the total number of matchers per HTTPRoute to be less than 128](https://github.com/kubernetes-sigs/gateway-api/blob/df8c96c254e1ac6d5f5e0d70617f36143723d479/apis/v1/httproute_types.go#L128).


This link isnt workin in the preview
https://deploy-preview-1859--gateway-api-inference-extension.netlify.app/guides/serve-multiple-genai-models/

this looks really strange in the UI:

additionally, I would avoid documenting Gateway API limitation so specifically.
I would phrase this as something like:

in case the number of rules matchers reached its maximum in the HttpRoute CR, one could use multiple/separate HttpRoute objects to map from model name to a pool.

not saying necessarily this exact text, but trying to keep the spirit of it, where we don't specify the exact number of max rules, cause these things may change over time (in Gateway API) and we don't want to update our docs whenever there's a change in Gateway API.
we should specify that there is some max number, and if that number is reached one could use more than one HttpRoute.

kfswain · 2025-11-20T01:18:22Z

site-src/guides/serve-multiple-genai-models.md

+        ```
+
+	2. Send a few requests to the LoRA of the Llama model as follows:
+       ```bash


formatting is strange here in the preview also.

this still looks bad formatting in the UI:

kfswain · 2025-11-20T01:18:36Z

site-src/guides/serve-multiple-genai-models.md

+          }'
+        ```
+
+	2. Send a few requests to the LoRA of the Llama model as follows:


suggest just using 1. for all ordered list entries

still showing as 2,3, ..

Co-authored-by: Nir Rozenbaum <[email protected]>

davidbreitgand · 2025-11-26T11:03:32Z

@nirrozenbaum, @kfswain Thanks for your feedback. I pushed a commit to address all the issues. Please review.

nirrozenbaum · 2025-12-02T06:32:24Z

config/manifests/bbr-example/httproute_bbr_lora.yaml

+        value: /
+      headers:
+        - type: Exact
+          #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.


can we remove these documentation comments?
looking at the HttpRoute, we don't really care where the header comes from (e.g., for testing purpose, a user may add this header manually).

Suggested change

#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

nirrozenbaum · 2025-12-02T06:32:52Z

config/manifests/bbr-example/httproute_bbr_lora.yaml

+        value: /
+      headers:
+        - type: Exact
+          #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.


ditto

Suggested change

#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

nirrozenbaum · 2025-12-02T06:33:38Z

config/manifests/bbr-example/httproute_bbr_lora.yaml

+        value: /
+      headers:
+        - type: Exact
+          #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.


ditto

Suggested change

#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

nirrozenbaum · 2025-12-02T07:06:05Z

site-src/guides/serve-multiple-genai-models.md

+In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name. 
+


I think it would be better to focus on the "yes" rather than the "no".
in this guide we present one way for mapping from Base/LoRA name to pool. there are other ways (like the one you originally proposed) where base model can be inferred.

I think we should remove the part that specifies LoRA is not standardized.
This could be more confusing than helping (for a newcomer).

Suggested change

In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name.

In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests.

nirrozenbaum · 2025-12-02T07:07:23Z

site-src/guides/serve-multiple-genai-models.md

+
 ## How

 The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.


which diagram?
(not added by your PR, but caught in the review)

nirrozenbaum · 2025-12-02T07:09:57Z

site-src/guides/serve-multiple-genai-models.md

+This guide assumes you have already setup the cluster for basic model serving as described in the [`Getting started`](index.md) guide and this guide describes the additional steps needed from that point onwards in order to deploy and exercise an example of routing across multiple models and multiple LoRAs with many to one relationship of LoRAs to the base model.


 ### Deploy Body-Based Routing Extension


can we update the version from v1.0.0 to v0?
ideally, we should end up like quickstart guide, having a guide for current main and a guide for latest stable release.
until we get there, it would be good to at least be consistent with one of these options. using v0 is following main.
this would require updating also the docker registry to staging registry.. see example from quickstart latest main guide.

nirrozenbaum · 2025-12-02T07:12:07Z

site-src/guides/serve-multiple-genai-models.md


 ### Deploy the 2nd InferencePool and Endpoint Picker Extension
-We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints or LORA adapters within each base model. Hence we create these for our second model as follows.
+We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints. 


same comment about the version here (the guide is still using v1.0.0)

nirrozenbaum · 2025-12-02T07:14:19Z

site-src/guides/serve-multiple-genai-models.md

+    helm install vllm-deepseek-r1 \
+	--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
+	--set provider.name=$GATEWAY_PROVIDER \                                       
+	--version $IGW_CHART_VERSION \              


where is IGW_CHART_VERSION defined?

nirrozenbaum · 2025-12-02T07:21:59Z

site-src/guides/serve-multiple-genai-models.md

+    timeouts:
+      request: 300s
+---
+apiVersion: gateway.networking.k8s.io/v1


these resources should be aligned (perfect match) with whatever we have in github.
so if you're making changes in the yaml files (e.g., merging the http route to a single route for the llama pool) that should be reflected here as well.

nirrozenbaum

@davidbreitgand thanks for the PR.
I left several comments.
additionally, I think the guide ended up to be much more complicated than I would expect. for example, if a newcomer wants to follow this guide, the instructions are to deploy a first pool, then a second pool with GPUs, then a third pool based on simulator.

I like the fact that you used simulator based pool, which is more user friendly (not all newcomers have GPUs handy). I think the simulator based pool should REPLACE the GPU phi pool and we should have in our guide 2 pools.

davidbreitgand · 2025-12-02T12:10:11Z

Closing this PR because it contains unintended commits. A new clean PR will follow and will reference this one.

Extending serving multiple AI models guide with an example of how to …

300fbbe

…serve multiple LoRAs (many LoRAs per one model while having multiple models)

k8s-ci-robot requested review from ahg-g and kfswain November 13, 2025 18:43

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 13, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 13, 2025

davidbreitgand mentioned this pull request Nov 13, 2025

Document well lit path for multiple LoRA support llm-d/llm-d-inference-scheduler#415

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 17, 2025

nirrozenbaum reviewed Nov 19, 2025

View reviewed changes

config/manifests/bbr-example/httproute_bbr_lora.yaml Outdated Show resolved Hide resolved

nirrozenbaum reviewed Nov 19, 2025

View reviewed changes

config/manifests/bbr-example/httproute_bbr_lora.yaml Outdated Show resolved Hide resolved

nirrozenbaum reviewed Nov 19, 2025

View reviewed changes

kfswain reviewed Nov 20, 2025

View reviewed changes

davidbreitgand and others added 2 commits November 25, 2025 10:37

Update config/manifests/bbr-example/httproute_bbr_lora.yaml

fe198fc

Co-authored-by: Nir Rozenbaum <[email protected]>

Update config/manifests/bbr-example/httproute_bbr_lora.yaml

b2302c0

Co-authored-by: Nir Rozenbaum <[email protected]>

davidbreitgand added 2 commits November 30, 2025 15:04

Merge branch 'kubernetes-sigs:main' into main

6bcb91e

Merge branch 'kubernetes-sigs:main' into main

dbb93bc

nirrozenbaum reviewed Dec 2, 2025

View reviewed changes

davidbreitgand closed this Dec 2, 2025

		In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name.


		## How

		The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.

		This guide assumes you have already setup the cluster for basic model serving as described in the [`Getting started`](index.md) guide and this guide describes the additional steps needed from that point onwards in order to deploy and exercise an example of routing across multiple models and multiple LoRAs with many to one relationship of LoRAs to the base model.


		### Deploy Body-Based Routing Extension

Extending the guide with multiple LoRAs serving with BBR #1859

Extending the guide with multiple LoRAs serving with BBR #1859

Uh oh!

Conversation

davidbreitgand commented Nov 13, 2025

Uh oh!

netlify bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Nov 13, 2025

Uh oh!

k8s-ci-robot commented Nov 13, 2025

Uh oh!

elevran commented Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidbreitgand commented Nov 26, 2025

Uh oh!

nirrozenbaum Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netlify bot commented Nov 13, 2025 •

edited

Loading

nirrozenbaum Dec 2, 2025 •

edited

Loading

nirrozenbaum Dec 2, 2025 •

edited

Loading

nirrozenbaum Dec 2, 2025 •

edited

Loading