-
Notifications
You must be signed in to change notification settings - Fork 205
Extending the guide with multiple LoRAs serving with BBR #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…serve multiple LoRAs (many LoRAs per one model while having multiple models)
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: davidbreitgand The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @davidbreitgand. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
| apiVersion: gateway.networking.k8s.io/v1 | ||
| kind: HTTPRoute | ||
| metadata: | ||
| name: vllm-llama3-8b-instruct-lora-food-review-1 #give this HTTPRoute any name that helps you to group and track the routes | ||
| spec: | ||
| parentRefs: | ||
| - group: gateway.networking.k8s.io | ||
| kind: Gateway | ||
| name: inference-gateway | ||
| rules: | ||
| - backendRefs: | ||
| - group: inference.networking.k8s.io | ||
| kind: InferencePool | ||
| name: vllm-llama3-8b-instruct | ||
| matches: | ||
| - path: | ||
| type: PathPrefix | ||
| value: / | ||
| headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. | ||
| name: X-Gateway-Model-Name | ||
| value: 'food-review-1' #this is the name of LoRA as defined in vLLM deployment | ||
| timeouts: | ||
| request: 300s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is not part of the first HttpRoute llm-llama-route that maps to InferencePool vllm-llama3-8b-instruct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could have been part of it, but, as explained in the guide, I want to show two different ways of defining routes: using matchers on a route and using independently defined routes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the motivation here for trying to show that this functionality can be configured in more than one way.
having said that, IMO this is more confusing than helping. from a newcomer point of view, I think it's not clear why for the deepseek pool base+LoRAs are configured in one HttpRoute Vs for llama it's configured differently.
we need to keep in mind that these quickstart guides serve as guiding principles for newcomers, how they should deploy and this should reflect implicitly our "recommendation".
I think we want to recommend using a single HttpRoute for a single pool and this is aligned with the new proposal, aiming to simplify the LoRA mapping with the ConfigMap so that we can have the route per pool automated through the helm chart deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirrozenbaum , my understanding is that one can use multiple HTTPRoutes to deal with the limitation on the number of matchers per single HTTPRoute.
Note that the manifest file uses two different ways of defining the routes to LoRAs: (1) via adding match clauses on the same base AI model HTTPRoute or by (2) defining a separate HTTPRoutes. There is no functional difference between the two methods, except for the limitation on the number of matchers per route imposed by the API Gateway
Admittedly, the intention here can and should be made more clear. I can elaborate on this issue and suggest that until the number of LoRAs for the base model is below the limit on the matchers per HTTPRoute, the preferred routing configuration is via a single HTTPRoute per EPP. If the number of LoRAs exceeds this limit, then one can define additional HTTPRoutes. I do not want to leave a user with the impression that she is more restricted than she actually is.
If you still think that simplicity is more important for this guide, and we only want a user to be exposed to routing configuration with a single HTTPRoute per EPP, I'll change the example accordingly.
Hope this makes sense. Please let me know.
| - --max-loras | ||
| - "2" | ||
| - --lora-modules | ||
| - '{"name": "food-review"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can call the lora adapters with completely different names to avoid confusion?
(the original deployment has food-review-1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still have food-review-1 in llama pool and food-review in deepseek pool..
|
@davidbreitgand minor addition (letting @nirrozenbaum drive the review) |
| ### Serving multiple LoRAs per base AI model | ||
|
|
||
| <div style="border: 1px solid red; padding: 10px; border-radius: 5px;"> | ||
| ⚠️ Known Limitation : LoRA names must be unique across the base AI models (i.e., across the backend inference server deployments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Known limitation almost implies its wrong in some way... can we just drop the limitation part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rephrased as "requirement", would this be Ok? I'd rather be explicit about this than let a user try and fail. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased to avoid a negative connotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still appears as know limitation
| <div style="border: 1px solid red; padding: 10px; border-radius: 5px;"> | ||
| ⚠️ Known Limitation : | ||
| [Kubernetes API Gateway limits the total number of matchers per HTTPRoute to be less than 128](https://github.com/kubernetes-sigs/gateway-api/blob/df8c96c254e1ac6d5f5e0d70617f36143723d479/apis/v1/httproute_types.go#L128). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link isnt workin in the preview
https://deploy-preview-1859--gateway-api-inference-extension.netlify.app/guides/serve-multiple-genai-models/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really strange in the UI:

additionally, I would avoid documenting Gateway API limitation so specifically.
I would phrase this as something like:
in case the number of rules matchers reached its maximum in the HttpRoute CR, one could use multiple/separate HttpRoute objects to map from model name to a pool.
not saying necessarily this exact text, but trying to keep the spirit of it, where we don't specify the exact number of max rules, cause these things may change over time (in Gateway API) and we don't want to update our docs whenever there's a change in Gateway API.
we should specify that there is some max number, and if that number is reached one could use more than one HttpRoute.
| ``` | ||
| 2. Send a few requests to the LoRA of the Llama model as follows: | ||
| ```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting is strange here in the preview also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }' | ||
| ``` | ||
| 2. Send a few requests to the LoRA of the Llama model as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest just using 1. for all ordered list entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still showing as 2,3, ..
Co-authored-by: Nir Rozenbaum <[email protected]>
Co-authored-by: Nir Rozenbaum <[email protected]>
|
@nirrozenbaum, @kfswain Thanks for your feedback. I pushed a commit to address all the issues. Please review. |
| value: / | ||
| headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove these documentation comments?
looking at the HttpRoute, we don't really care where the header comes from (e.g., for testing purpose, a user may add this header manually).
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
| value: / | ||
| headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
| value: / | ||
| headers: | ||
| - type: Exact | ||
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header. |
| In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to focus on the "yes" rather than the "no".
in this guide we present one way for mapping from Base/LoRA name to pool. there are other ways (like the one you originally proposed) where base model can be inferred.
I think we should remove the part that specifies LoRA is not standardized.
This could be more confusing than helping (for a newcomer).
| In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. LoRA naming is not standardised. Therefore, it cannot be expected that the base model name can be inferred from the LoRA name. | |
| In addition, for each base AI model multiple [Low Rank Adaptaions (LoRAs)](https://www.ibm.com/think/topics/lora) can be defined. LoRAs defined for the same base AI model are served from the same backend inference server that serves the base model. A LoRA name is specified as the Model name in the body of LLM prompt requests. |
|
|
||
| ## How | ||
|
|
||
| The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which diagram?
(not added by your PR, but caught in the review)
| This guide assumes you have already setup the cluster for basic model serving as described in the [`Getting started`](index.md) guide and this guide describes the additional steps needed from that point onwards in order to deploy and exercise an example of routing across multiple models and multiple LoRAs with many to one relationship of LoRAs to the base model. | ||
|
|
||
|
|
||
| ### Deploy Body-Based Routing Extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the version from v1.0.0 to v0?
ideally, we should end up like quickstart guide, having a guide for current main and a guide for latest stable release.
until we get there, it would be good to at least be consistent with one of these options. using v0 is following main.
this would require updating also the docker registry to staging registry.. see example from quickstart latest main guide.
|
|
||
| ### Deploy the 2nd InferencePool and Endpoint Picker Extension | ||
| We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints or LORA adapters within each base model. Hence we create these for our second model as follows. | ||
| We also want to use an InferencePool and EndPoint Picker for this second model in addition to the Body Based Router in order to be able to schedule across multiple endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about the version here (the guide is still using v1.0.0)
| helm install vllm-deepseek-r1 \ | ||
| --set inferencePool.modelServers.matchLabels.app=vllm-deepseek-r1 \ | ||
| --set provider.name=$GATEWAY_PROVIDER \ | ||
| --version $IGW_CHART_VERSION \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is IGW_CHART_VERSION defined?
| timeouts: | ||
| request: 300s | ||
| --- | ||
| apiVersion: gateway.networking.k8s.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these resources should be aligned (perfect match) with whatever we have in github.
so if you're making changes in the yaml files (e.g., merging the http route to a single route for the llama pool) that should be reflected here as well.
nirrozenbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidbreitgand thanks for the PR.
I left several comments.
additionally, I think the guide ended up to be much more complicated than I would expect. for example, if a newcomer wants to follow this guide, the instructions are to deploy a first pool, then a second pool with GPUs, then a third pool based on simulator.
I like the fact that you used simulator based pool, which is more user friendly (not all newcomers have GPUs handy). I think the simulator based pool should REPLACE the GPU phi pool and we should have in our guide 2 pools.
|
Closing this PR because it contains unintended commits. A new clean PR will follow and will reference this one. |

kind/documentation
Closes #1858