Skip to content

Conversation

@davidbreitgand
Copy link
Contributor

kind/documentation
Closes #1858

Extends documentation

…serve multiple LoRAs (many LoRAs per one model while having multiple models)
@netlify
Copy link

netlify bot commented Nov 13, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit dbb93bc
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/692de3442273260008bce37a
😎 Deploy Preview https://deploy-preview-1859--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: davidbreitgand
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 13, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @davidbreitgand. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 13, 2025
@elevran
Copy link
Contributor

elevran commented Nov 17, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 17, 2025
Comment on lines +69 to +93
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: vllm-llama3-8b-instruct-lora-food-review-1 #give this HTTPRoute any name that helps you to group and track the routes
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
matches:
- path:
type: PathPrefix
value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
name: X-Gateway-Model-Name
value: 'food-review-1' #this is the name of LoRA as defined in vLLM deployment
timeouts:
request: 300s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not part of the first HttpRoute llm-llama-route that maps to InferencePool vllm-llama3-8b-instruct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have been part of it, but, as explained in the guide, I want to show two different ways of defining routes: using matchers on a route and using independently defined routes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation here for trying to show that this functionality can be configured in more than one way.
having said that, IMO this is more confusing than helping. from a newcomer point of view, I think it's not clear why for the deepseek pool base+LoRAs are configured in one HttpRoute Vs for llama it's configured differently.

we need to keep in mind that these quickstart guides serve as guiding principles for newcomers, how they should deploy and this should reflect implicitly our "recommendation".
I think we want to recommend using a single HttpRoute for a single pool and this is aligned with the new proposal, aiming to simplify the LoRA mapping with the ConfigMap so that we can have the route per pool automated through the helm chart deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirrozenbaum , my understanding is that one can use multiple HTTPRoutes to deal with the limitation on the number of matchers per single HTTPRoute.

Note that the manifest file uses two different ways of defining the routes to LoRAs: (1) via adding match clauses on the same base AI model HTTPRoute or by (2) defining a separate HTTPRoutes. There is no functional difference between the two methods, except for the limitation on the number of matchers per route imposed by the API Gateway

Admittedly, the intention here can and should be made more clear. I can elaborate on this issue and suggest that until the number of LoRAs for the base model is below the limit on the matchers per HTTPRoute, the preferred routing configuration is via a single HTTPRoute per EPP. If the number of LoRAs exceeds this limit, then one can define additional HTTPRoutes. I do not want to leave a user with the impression that she is more restricted than she actually is.
If you still think that simplicity is more important for this guide, and we only want a user to be exposed to routing configuration with a single HTTPRoute per EPP, I'll change the example accordingly.

Hope this makes sense. Please let me know.

- --max-loras
- "2"
- --lora-modules
- '{"name": "food-review"}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can call the lora adapters with completely different names to avoid confusion?
(the original deployment has food-review-1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still have food-review-1 in llama pool and food-review in deepseek pool..

@elevran
Copy link
Contributor

elevran commented Nov 19, 2025

@davidbreitgand minor addition (letting @nirrozenbaum drive the review)
please consider changing the documentation comment on the PR's description to better reflect the change from a user's perspective.

### Serving multiple LoRAs per base AI model

<div style="border: 1px solid red; padding: 10px; border-radius: 5px;">
⚠️ Known Limitation : LoRA names must be unique across the base AI models (i.e., across the backend inference server deployments)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known limitation almost implies its wrong in some way... can we just drop the limitation part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rephrased as "requirement", would this be Ok? I'd rather be explicit about this than let a user try and fail. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased to avoid a negative connotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears as know limitation

<div style="border: 1px solid red; padding: 10px; border-radius: 5px;">
⚠️ Known Limitation :
[Kubernetes API Gateway limits the total number of matchers per HTTPRoute to be less than 128](https://github.com/kubernetes-sigs/gateway-api/blob/df8c96c254e1ac6d5f5e0d70617f36143723d479/apis/v1/httproute_types.go#L128).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really strange in the UI:
image

additionally, I would avoid documenting Gateway API limitation so specifically.
I would phrase this as something like:

in case the number of rules matchers reached its maximum in the HttpRoute CR, one could use multiple/separate HttpRoute objects to map from model name to a pool.

not saying necessarily this exact text, but trying to keep the spirit of it, where we don't specify the exact number of max rules, cause these things may change over time (in Gateway API) and we don't want to update our docs whenever there's a change in Gateway API.
we should specify that there is some max number, and if that number is reached one could use more than one HttpRoute.

```
2. Send a few requests to the LoRA of the Llama model as follows:
```bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting is strange here in the preview also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still looks bad formatting in the UI:
image

}'
```
2. Send a few requests to the LoRA of the Llama model as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest just using 1. for all ordered list entries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still showing as 2,3, ..

@davidbreitgand
Copy link
Contributor Author

@nirrozenbaum, @kfswain Thanks for your feedback. I pushed a commit to address all the issues. Please review.

value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

@nirrozenbaum nirrozenbaum Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove these documentation comments?
looking at the HttpRoute, we don't really care where the header comes from (e.g., for testing purpose, a user may add this header manually).

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

value: /
headers:
- type: Exact
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.

Comment on lines +7 to +8
In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name.

Copy link
Contributor

@nirrozenbaum nirrozenbaum Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to focus on the "yes" rather than the "no".
in this guide we present one way for mapping from Base/LoRA name to pool. there are other ways (like the one you originally proposed) where base model can be inferred.

I think we should remove the part that specifies LoRA is not standardized.
This could be more confusing than helping (for a newcomer).

Suggested change
In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name.
In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests.


## How

The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which diagram?
(not added by your PR, but caught in the review)

This guide assumes you have already setup the cluster for basic model serving as described in the [`Getting started`](index.md) guide and this guide describes the additional steps needed from that point onwards in order to deploy and exercise an example of routing across multiple models and multiple LoRAs with many to one relationship of LoRAs to the base model.


### Deploy Body-Based Routing Extension
Copy link
Contributor

@nirrozenbaum nirrozenbaum Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the version from v1.0.0 to v0?
ideally, we should end up like quickstart guide, having a guide for current main and a guide for latest stable release.
until we get there, it would be good to at least be consistent with one of these options. using v0 is following main.
this would require updating also the docker registry to staging registry.. see example from quickstart latest main guide.


### Deploy the 2nd InferencePool and Endpoint Picker Extension
We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints or LORA adapters within each base model. Hence we create these for our second model as follows.
We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the version here (the guide is still using v1.0.0)

helm install vllm-deepseek-r1 \
--set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \
--set provider.name=$GATEWAY_PROVIDER \
--version $IGW_CHART_VERSION \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is IGW_CHART_VERSION defined?

timeouts:
request: 300s
---
apiVersion: gateway.networking.k8s.io/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these resources should be aligned (perfect match) with whatever we have in github.
so if you're making changes in the yaml files (e.g., merging the http route to a single route for the llama pool) that should be reflected here as well.

Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbreitgand thanks for the PR.
I left several comments.
additionally, I think the guide ended up to be much more complicated than I would expect. for example, if a newcomer wants to follow this guide, the instructions are to deploy a first pool, then a second pool with GPUs, then a third pool based on simulator.

I like the fact that you used simulator based pool, which is more user friendly (not all newcomers have GPUs handy). I think the simulator based pool should REPLACE the GPU phi pool and we should have in our guide 2 pools.

@davidbreitgand
Copy link
Contributor Author

Closing this PR because it contains unintended commits. A new clean PR will follow and will reference this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update the serving multiple AI models guide with multiple LoRA example

5 participants