Skip to content

Add vLLM guide #2167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Add vLLM guide #2167

wants to merge 1 commit into from

Conversation

Bond099
Copy link

@Bond099 Bond099 commented Aug 18, 2025

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Bond099, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new guide focused on serving Gemma models from KerasHub using vLLM. The guide provides a step-by-step process for exporting Gemma models to the Hugging Face format and then deploying them with vLLM for high-throughput inference. It covers essential setup, model conversion, and practical considerations for running the workflow within a Google Colab environment, demonstrating how to leverage vLLM's performance benefits for large language models.

Highlights

  • New Guide for vLLM Integration: This pull request introduces a new, comprehensive guide that demonstrates how to export Gemma models from KerasHub to the Hugging Face format and subsequently serve them using vLLM for optimized inference.
  • KerasHub to Hugging Face Export: The guide provides detailed steps on how to convert KerasHub Gemma models into the Hugging Face safetensors format, which is a prerequisite for vLLM serving.
  • vLLM Serving Workflow: It outlines the complete workflow for setting up and utilizing vLLM, including downloading necessary metadata, updating model index files, and performing text generation with various sampling parameters.
  • Colab Environment Specifics: The guide specifically addresses practical considerations for users working in Google Colab, such as the necessity of switching between TPU and GPU runtimes for model export and serving, and the use of Google Drive for persistent storage.
  • Gemma 2 Model Support: The guide focuses on the gemma2_instruct_2b_en preset, showcasing how to load, export, and serve this specific Gemma 2 variant, including examples for both simple and chat-templated prompts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new guide for serving Gemma models with vLLM. The guide is well-structured and covers the process from exporting a KerasHub model to running inference with vLLM. My feedback includes suggestions to improve code maintainability and clarity by removing redundant code and variables, and by refactoring repeated logic into a helper function.

Comment on lines +51 to +64
# Set export path
export_path = "./gemma_exported"

# Export to Hugging Face format
gemma_lm.export_to_transformers(export_path)
print(f"Model exported successfully to {export_path}")

"""
## Downloading Additional Metadata

vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
"""

SERVABLE_CKPT_DIR = "./gemma_exported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable export_path is defined and used, but then SERVABLE_CKPT_DIR is defined with the same value for the same purpose. To improve clarity and avoid redundancy, you can define SERVABLE_CKPT_DIR once at the beginning and use it throughout the script for the local export path.

Suggested change
# Set export path
export_path = "./gemma_exported"
# Export to Hugging Face format
gemma_lm.export_to_transformers(export_path)
print(f"Model exported successfully to {export_path}")
"""
## Downloading Additional Metadata
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
"""
SERVABLE_CKPT_DIR = "./gemma_exported"
# Set export path
SERVABLE_CKPT_DIR = "./gemma_exported"
# Export to Hugging Face format
gemma_lm.export_to_transformers(SERVABLE_CKPT_DIR)
print(f"Model exported successfully to {SERVABLE_CKPT_DIR}")
"""
## Downloading Additional Metadata
vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +89 to +93
# Verify the directory contents
print("Directory contents:")
for file in os.listdir(SERVABLE_CKPT_DIR):
size = os.path.getsize(os.path.join(SERVABLE_CKPT_DIR, file)) / (1024 * 1024)
print(f"{file}: {size:.2f} MB")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for verifying directory contents is repeated three times in the script (here, at lines 121-124, and 136-139). To follow the DRY (Don't Repeat Yourself) principle and improve maintainability, you could refactor this logic into a helper function. You can define a function like print_directory_contents and call it where needed.

For example:

def print_directory_contents(directory):
    print(f"Directory contents for {directory}:")
    for file in os.listdir(directory):
        size = os.path.getsize(os.path.join(directory, file)) / (1024 * 1024)
        print(f"{file}: {size:.2f} MB")

Then you can replace this block and the other occurrences with a call to this function, e.g., print_directory_contents(SERVABLE_CKPT_DIR).

Disconnect TPU runtime (if applicable) and re-connect with a T4 GPU runtime before proceeding.
"""

from google.colab import drive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The google.colab.drive module is already imported on line 103. This import is redundant and can be removed.

@abheesht17 abheesht17 self-requested a review August 19, 2025 07:28
Copy link
Collaborator

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some comments, the big ones being 2x mem, switching from TPU --> GPU midway, and mounting gDrive (can we do without it?)

@@ -0,0 +1,216 @@
"""
Title: Serving Gemma with vLLM
Author: Dhiraj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to add your GitHub handle here!

Date created: 2025/08/16
Last modified: 2025/08/18
Description: Export Gemma models from KerasHub to Hugging Face and serve with vLLM for fast inference.
Accelerator: TPU and GPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does vLLM work with TPUs as of now? I thought it was only for GPUs. In any case, let's just put one of the two here. Not both.

"""
## Introduction

This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can break long lines into multiple lines. I think the code formatter will fail if you write just one long line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferences --> inference

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use generation instead of inference in some places


This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.

vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the reference paper and GitHub repo somewhere at the bottom of the introduction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind if the description of vLLM is a bit more descriptive here. What do you think?


vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem

At present, this is supported only for Gemma 2 and its presets. In the future, there will be more coverage of the models in KerasHub.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this line, because we will be adding support soon.

"""

"""shell
!pip install -q --upgrade keras-hub huggingface-hub
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keras-hub is not needed here, because before running the notebook generation script, autogen.py runs requirements.txt

"""
## Loading and Exporting the Model

Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'gemma2_instruct_2b_en' --> 'gemma2_instruct_2b_en'


Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.

**Note:** The export method needs to map the weights from Keras to safetensors, hence requiring double the RAM needed to load a preset. This is also the reason why we are running on a TPU instance in Colab as it offers more VRAM instead of GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiring double the RAM needed to load a preset --> Oh, I thought we found a way to avoid this in the export PR?

Comment on lines +51 to +64
# Set export path
export_path = "./gemma_exported"

# Export to Hugging Face format
gemma_lm.export_to_transformers(export_path)
print(f"Model exported successfully to {export_path}")

"""
## Downloading Additional Metadata

vLLM requires complete Hugging Face model configuration files. Download these from the original Gemma repository on Hugging Face.
"""

SERVABLE_CKPT_DIR = "./gemma_exported"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

"""
## Saving to Google Drive

Save the files to Google Drive. This is needed because vLLM currently [does not support TPU v2 on Colab](https://docs.vllm.ai/en/v0.5.5/getting_started/tpu-installation.html) and cannot dynamically switch the backend to CPU. Switch to a different Colab GPU instance for serving after saving. If you are using Cloud TPU or GPU from the start, you may skip this step.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can resolve the 2x mem issue, we can potentially go ahead with just GPUs. It's a bit awkward to switch backends here, and also won't work with autogen.py.

Also, are you using full precision? Might want to go with bfloat16

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want to use gDrive here, best to save this inside /content on Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants