Add vLLM guide #2167

Bond099 · 2025-08-18T19:29:19Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @Bond099, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new guide focused on serving Gemma models from KerasHub using vLLM. The guide provides a step-by-step process for exporting Gemma models to the Hugging Face format and then deploying them with vLLM for high-throughput inference. It covers essential setup, model conversion, and practical considerations for running the workflow within a Google Colab environment, demonstrating how to leverage vLLM's performance benefits for large language models.

Highlights

New Guide for vLLM Integration: This pull request introduces a new, comprehensive guide that demonstrates how to export Gemma models from KerasHub to the Hugging Face format and subsequently serve them using vLLM for optimized inference.
KerasHub to Hugging Face Export: The guide provides detailed steps on how to convert KerasHub Gemma models into the Hugging Face safetensors format, which is a prerequisite for vLLM serving.
vLLM Serving Workflow: It outlines the complete workflow for setting up and utilizing vLLM, including downloading necessary metadata, updating model index files, and performing text generation with various sampling parameters.
Colab Environment Specifics: The guide specifically addresses practical considerations for users working in Google Colab, such as the necessity of switching between TPU and GPU runtimes for model export and serving, and the use of Google Drive for persistent storage.
Gemma 2 Model Support: The guide focuses on the gemma2_instruct_2b_en preset, showcasing how to load, export, and serve this specific Gemma 2 variant, including examples for both simple and chat-templated prompts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new guide for serving Gemma models with vLLM. The guide is well-structured and covers the process from exporting a KerasHub model to running inference with vLLM. My feedback includes suggestions to improve code maintainability and clarity by removing redundant code and variables, and by refactoring repeated logic into a helper function.

gemini-code-assist · 2025-08-18T19:31:07Z

guides/keras_hub/vllm_guide.py

+# Set export path
+export_path = "./gemma_exported"
+
+# Export to Hugging Face format
+gemma_lm.export_to_transformers(export_path)
+print(f"Model exported successfully to {export_path}")
+
+"""
+## Downloading Additional Metadata
+
+vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
+"""
+
+SERVABLE_CKPT_DIR = "./gemma_exported"


The variable export_path is defined and used, but then SERVABLE_CKPT_DIR is defined with the same value for the same purpose. To improve clarity and avoid redundancy, you can define SERVABLE_CKPT_DIR once at the beginning and use it throughout the script for the local export path.

Suggested change

# Set export path

export_path = "./gemma_exported"

# Export to Hugging Face format

gemma_lm.export_to_transformers(export_path)

print(f"Model exported successfully to {export_path}")

"""

## Downloading Additional Metadata

vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.

"""

SERVABLE_CKPT_DIR = "./gemma_exported"

# Set export path

SERVABLE_CKPT_DIR = "./gemma_exported"

# Export to Hugging Face format

gemma_lm.export_to_transformers(SERVABLE_CKPT_DIR)

print(f"Model exported successfully to {SERVABLE_CKPT_DIR}")

"""

## Downloading Additional Metadata

vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.

"""

gemini-code-assist · 2025-08-18T19:31:08Z

guides/keras_hub/vllm_guide.py

+# Verify the directory contents
+print("Directory contents:")
+for file in os.listdir(SERVABLE_CKPT_DIR):
+    size = os.path.getsize(os.path.join(SERVABLE_CKPT_DIR, file)) / (1024 * 1024)
+    print(f"{file}: {size:.2f} MB")


This block of code for verifying directory contents is repeated three times in the script (here, at lines 121-124, and 136-139). To follow the DRY (Don't Repeat Yourself) principle and improve maintainability, you could refactor this logic into a helper function. You can define a function like print_directory_contents and call it where needed.

For example:

def print_directory_contents(directory): print(f"Directory contents for {directory}:") for file in os.listdir(directory): size = os.path.getsize(os.path.join(directory, file)) / (1024 * 1024) print(f"{file}: {size:.2f} MB")

Then you can replace this block and the other occurrences with a call to this function, e.g., print_directory_contents(SERVABLE_CKPT_DIR).

gemini-code-assist · 2025-08-18T19:31:08Z

guides/keras_hub/vllm_guide.py

+Disconnect TPU runtime (if applicable) and re-connect with a T4 GPU runtime before proceeding.
+"""
+
+from google.colab import drive


The google.colab.drive module is already imported on line 103. This import is redundant and can be removed.

abheesht17

Thanks for the PR! Left some comments, the big ones being 2x mem, switching from TPU --> GPU midway, and mounting gDrive (can we do without it?)

abheesht17 · 2025-08-19T22:58:54Z

guides/keras_hub/vllm_guide.py

@@ -0,0 +1,216 @@
+"""
+Title: Serving Gemma with vLLM 
+Author: Dhiraj 


Good to add your GitHub handle here!

abheesht17 · 2025-08-19T22:59:42Z

guides/keras_hub/vllm_guide.py

+Date created: 2025/08/16
+Last modified: 2025/08/18
+Description: Export Gemma models from KerasHub to Hugging Face and serve with vLLM for fast inference.
+Accelerator: TPU and GPU


Does vLLM work with TPUs as of now? I thought it was only for GPUs. In any case, let's just put one of the two here. Not both.

abheesht17 · 2025-08-19T23:00:57Z

guides/keras_hub/vllm_guide.py

+"""
+## Introduction
+
+This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.


You can break long lines into multiple lines. I think the code formatter will fail if you write just one long line.

inferences --> inference

We can also use generation instead of inference in some places

abheesht17 · 2025-08-19T23:02:43Z

guides/keras_hub/vllm_guide.py

+
+This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.
+
+vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem


Let's add the reference paper and GitHub repo somewhere at the bottom of the introduction.

I wouldn't mind if the description of vLLM is a bit more descriptive here. What do you think?

abheesht17 · 2025-08-19T23:05:16Z

guides/keras_hub/vllm_guide.py

+
+vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem
+
+At present, this is supported only for Gemma 2 and its presets. In the future, there will be more coverage of the models in KerasHub.


I think we can remove this line, because we will be adding support soon.

abheesht17 · 2025-08-19T23:08:04Z

guides/keras_hub/vllm_guide.py

+"""
+
+"""shell
+!pip install -q --upgrade keras-hub huggingface-hub


keras-hub is not needed here, because before running the notebook generation script, autogen.py runs requirements.txt

abheesht17 · 2025-08-19T23:08:48Z

guides/keras_hub/vllm_guide.py

+"""
+## Loading and Exporting the Model
+
+Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.


'gemma2_instruct_2b_en' --> 'gemma2_instruct_2b_en'

abheesht17 · 2025-08-19T23:09:17Z

guides/keras_hub/vllm_guide.py

+
+Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.
+
+**Note:** The export method needs to map the weights from Keras to safetensors, hence requiring double the RAM needed to load a preset. This is also the reason why we are running on a TPU instance in Colab as it offers more VRAM instead of GPU.


requiring double the RAM needed to load a preset --> Oh, I thought we found a way to avoid this in the export PR?

abheesht17 · 2025-08-19T23:12:17Z

guides/keras_hub/vllm_guide.py

+# Set export path
+export_path = "./gemma_exported"
+
+# Export to Hugging Face format
+gemma_lm.export_to_transformers(export_path)
+print(f"Model exported successfully to {export_path}")
+
+"""
+## Downloading Additional Metadata
+
+vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
+"""
+
+SERVABLE_CKPT_DIR = "./gemma_exported"


abheesht17 · 2025-08-19T23:14:35Z

guides/keras_hub/vllm_guide.py

+"""
+## Saving to Google Drive
+
+Save the files to Google Drive. This is needed because vLLM currently [does not support TPU v2 on Colab](https://docs.vllm.ai/en/v0.5.5/getting_started/tpu-installation.html) and cannot dynamically switch the backend to CPU. Switch to a different Colab GPU instance for serving after saving. If you are using Cloud TPU or GPU from the start, you may skip this step.


I think if we can resolve the 2x mem issue, we can potentially go ahead with just GPUs. It's a bit awkward to switch backends here, and also won't work with autogen.py.

Also, are you using full precision? Might want to go with bfloat16

We might not want to use gDrive here, best to save this inside /content on Colab

divyashreepathihalli · 2025-10-01T23:39:51Z

@Bond099 do you have an update on this PR?

Bond099 · 2025-10-02T18:51:34Z

@Bond099 do you have an update on this PR?

Once this PR is merged, we can proceed with this

github-actions · 2025-10-17T02:03:34Z

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2025-11-06T02:08:44Z

This PR is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2025-11-21T02:07:35Z

This PR was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

Add vllm guide

f8ecb97

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

github-actions bot assigned sachinprasadhs Aug 18, 2025

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

abheesht17 self-requested a review August 19, 2025 07:28

abheesht17 requested changes Aug 19, 2025

View reviewed changes

sachinprasadhs added the stat:awaiting response from contributor label Sep 19, 2025

github-actions bot added the stale label Oct 17, 2025

kmk142789 approved these changes Oct 22, 2025

View reviewed changes

github-actions bot removed the stale label Oct 23, 2025

github-actions bot added the stale label Nov 6, 2025

github-actions bot closed this Nov 21, 2025


		This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.

		vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem


		vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem

		At present, this is supported only for Gemma 2 and its presets. In the future, there will be more coverage of the models in KerasHub.


		Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.

		Note: The export method needs to map the weights from Keras to safetensors, hence requiring double the RAM needed to load a preset. This is also the reason why we are running on a TPU instance in Colab as it offers more VRAM instead of GPU.

Add vLLM guide #2167

Add vLLM guide #2167

Uh oh!

Conversation

Bond099 commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

abheesht17 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli commented Oct 1, 2025

Uh oh!

Bond099 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abheesht17 left a comment •

edited

Loading

Bond099 commented Oct 2, 2025 •

edited

Loading