-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add vLLM guide #2167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add vLLM guide #2167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Bond099, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new guide focused on serving Gemma models from KerasHub using vLLM. The guide provides a step-by-step process for exporting Gemma models to the Hugging Face format and then deploying them with vLLM for high-throughput inference. It covers essential setup, model conversion, and practical considerations for running the workflow within a Google Colab environment, demonstrating how to leverage vLLM's performance benefits for large language models.
Highlights
- New Guide for vLLM Integration: This pull request introduces a new, comprehensive guide that demonstrates how to export Gemma models from KerasHub to the Hugging Face format and subsequently serve them using vLLM for optimized inference.
- KerasHub to Hugging Face Export: The guide provides detailed steps on how to convert KerasHub Gemma models into the Hugging Face
safetensors
format, which is a prerequisite for vLLM serving. - vLLM Serving Workflow: It outlines the complete workflow for setting up and utilizing vLLM, including downloading necessary metadata, updating model index files, and performing text generation with various sampling parameters.
- Colab Environment Specifics: The guide specifically addresses practical considerations for users working in Google Colab, such as the necessity of switching between TPU and GPU runtimes for model export and serving, and the use of Google Drive for persistent storage.
- Gemma 2 Model Support: The guide focuses on the
gemma2_instruct_2b_en
preset, showcasing how to load, export, and serve this specific Gemma 2 variant, including examples for both simple and chat-templated prompts.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new guide for serving Gemma models with vLLM. The guide is well-structured and covers the process from exporting a KerasHub model to running inference with vLLM. My feedback includes suggestions to improve code maintainability and clarity by removing redundant code and variables, and by refactoring repeated logic into a helper function.
# Set export path | ||
export_path = "./gemma_exported" | ||
|
||
# Export to Hugging Face format | ||
gemma_lm.export_to_transformers(export_path) | ||
print(f"Model exported successfully to {export_path}") | ||
|
||
""" | ||
## Downloading Additional Metadata | ||
|
||
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face. | ||
""" | ||
|
||
SERVABLE_CKPT_DIR = "./gemma_exported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable export_path
is defined and used, but then SERVABLE_CKPT_DIR
is defined with the same value for the same purpose. To improve clarity and avoid redundancy, you can define SERVABLE_CKPT_DIR
once at the beginning and use it throughout the script for the local export path.
# Set export path | |
export_path = "./gemma_exported" | |
# Export to Hugging Face format | |
gemma_lm.export_to_transformers(export_path) | |
print(f"Model exported successfully to {export_path}") | |
""" | |
## Downloading Additional Metadata | |
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face. | |
""" | |
SERVABLE_CKPT_DIR = "./gemma_exported" | |
# Set export path | |
SERVABLE_CKPT_DIR = "./gemma_exported" | |
# Export to Hugging Face format | |
gemma_lm.export_to_transformers(SERVABLE_CKPT_DIR) | |
print(f"Model exported successfully to {SERVABLE_CKPT_DIR}") | |
""" | |
## Downloading Additional Metadata | |
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
# Verify the directory contents | ||
print("Directory contents:") | ||
for file in os.listdir(SERVABLE_CKPT_DIR): | ||
size = os.path.getsize(os.path.join(SERVABLE_CKPT_DIR, file)) / (1024 * 1024) | ||
print(f"{file}: {size:.2f} MB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for verifying directory contents is repeated three times in the script (here, at lines 121-124, and 136-139). To follow the DRY (Don't Repeat Yourself) principle and improve maintainability, you could refactor this logic into a helper function. You can define a function like print_directory_contents
and call it where needed.
For example:
def print_directory_contents(directory):
print(f"Directory contents for {directory}:")
for file in os.listdir(directory):
size = os.path.getsize(os.path.join(directory, file)) / (1024 * 1024)
print(f"{file}: {size:.2f} MB")
Then you can replace this block and the other occurrences with a call to this function, e.g., print_directory_contents(SERVABLE_CKPT_DIR)
.
Disconnect TPU runtime (if applicable) and re-connect with a T4 GPU runtime before proceeding. | ||
""" | ||
|
||
from google.colab import drive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left some comments, the big ones being 2x mem, switching from TPU --> GPU midway, and mounting gDrive (can we do without it?)
@@ -0,0 +1,216 @@ | |||
""" | |||
Title: Serving Gemma with vLLM | |||
Author: Dhiraj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to add your GitHub handle here!
Date created: 2025/08/16 | ||
Last modified: 2025/08/18 | ||
Description: Export Gemma models from KerasHub to Hugging Face and serve with vLLM for fast inference. | ||
Accelerator: TPU and GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does vLLM work with TPUs as of now? I thought it was only for GPUs. In any case, let's just put one of the two here. Not both.
""" | ||
## Introduction | ||
|
||
This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can break long lines into multiple lines. I think the code formatter will fail if you write just one long line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inferences --> inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use generation instead of inference in some places
|
||
This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment. | ||
|
||
vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the reference paper and GitHub repo somewhere at the bottom of the introduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind if the description of vLLM is a bit more descriptive here. What do you think?
|
||
vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem | ||
|
||
At present, this is supported only for Gemma 2 and its presets. In the future, there will be more coverage of the models in KerasHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this line, because we will be adding support soon.
""" | ||
|
||
"""shell | ||
!pip install -q --upgrade keras-hub huggingface-hub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keras-hub
is not needed here, because before running the notebook generation script, autogen.py
runs requirements.txt
""" | ||
## Loading and Exporting the Model | ||
|
||
Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'gemma2_instruct_2b_en' --> 'gemma2_instruct_2b_en'
|
||
Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks. | ||
|
||
**Note:** The export method needs to map the weights from Keras to safetensors, hence requiring double the RAM needed to load a preset. This is also the reason why we are running on a TPU instance in Colab as it offers more VRAM instead of GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requiring double the RAM needed to load a preset --> Oh, I thought we found a way to avoid this in the export PR?
# Set export path | ||
export_path = "./gemma_exported" | ||
|
||
# Export to Hugging Face format | ||
gemma_lm.export_to_transformers(export_path) | ||
print(f"Model exported successfully to {export_path}") | ||
|
||
""" | ||
## Downloading Additional Metadata | ||
|
||
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face. | ||
""" | ||
|
||
SERVABLE_CKPT_DIR = "./gemma_exported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
""" | ||
## Saving to Google Drive | ||
|
||
Save the files to Google Drive. This is needed because vLLM currently [does not support TPU v2 on Colab](https://docs.vllm.ai/en/v0.5.5/getting_started/tpu-installation.html) and cannot dynamically switch the backend to CPU. Switch to a different Colab GPU instance for serving after saving. If you are using Cloud TPU or GPU from the start, you may skip this step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we can resolve the 2x mem issue, we can potentially go ahead with just GPUs. It's a bit awkward to switch backends here, and also won't work with autogen.py
.
Also, are you using full precision? Might want to go with bfloat16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not want to use gDrive here, best to save this inside /content
on Colab
No description provided.