Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Gemini client for Gemini API and Vertex AI #5524

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

yu-iskw
Copy link

@yu-iskw yu-iskw commented Feb 13, 2025

Why are these changes needed?

This pull request introduces support for Google’s Gemini API into the autogen-ext package. The changes include two new client implementations—GeminiChatCompletionClient and VertexAIChatCompletionClient—which enable users to interact with the Gemini models for advanced chat completions. The new clients support:

  • Long Context Handling: Efficiently manage extended conversations with context caching.
  • Vision/Multimodal Inputs: Process image inputs and other multimedia data.
  • Function Calling: Integrate function/tool calling capabilities within chat interactions.
  • Structured Output: Handle responses in JSON format for easy post-processing.
  • Robust Error Handling & Streaming Responses: Improve the reliability and interactivity of chat completions.
  • Token Management: Accurate token counting and remaining token calculations.

Related issue number

#3741
Closes #5528

Checks

@yu-iskw yu-iskw mentioned this pull request Feb 13, 2025
3 tasks
Signed-off-by: Yu Ishikawa <[email protected]>
Signed-off-by: Yu Ishikawa <[email protected]>
Signed-off-by: Yu Ishikawa <[email protected]>
Signed-off-by: Yu Ishikawa <[email protected]>
Signed-off-by: Yu Ishikawa <[email protected]>
@yu-iskw
Copy link
Author

yu-iskw commented Feb 13, 2025

@microsoft-github-policy-service agree

@ekzhu
Copy link
Collaborator

ekzhu commented Feb 14, 2025

Vision/Multimodal Inputs: Process image inputs and other multimedia data.

@yu-iskw does this client support multimodal output as well?

@yu-iskw
Copy link
Author

yu-iskw commented Feb 14, 2025

@ekzhu Good point. I am seeking a better approach to support both of text generation and image generation with Gemini, since the method and its configurations of each are different. I appriciate if you could give good ideas to handle this.

As far as I know, the API to text and image with OpenAI and Azure OpenAI is the same. So, it is unnecessary to use different APIs no matter which we want to generate text or image. I suppose it would be good to add a field whether or not a model supports image generation to ModelInfo class. By doing so, we can effectively select APIs for user's request based on the information.

class ModelInfo(TypedDict, total=False):
vision: Required[bool]
"""True if the model supports vision, aka image input, otherwise False."""
function_calling: Required[bool]
"""True if the model supports function calling, otherwise False."""
json_output: Required[bool]
"""True if the model supports json output, otherwise False. Note: this is different to structured json."""
family: Required[ModelFamily.ANY | str]
"""Model family should be one of the constants from :py:class:`ModelFamily` or a string representing an unknown model family."""

If there is no effective way at the moment and need to align the core component as ModelInfo, I think it might be good to start with only the text generation first and will support image generation later.

[UPDATE]
I have come up with a tentative solution to handle this. We can add a model family like IMAGEN_3_0 and use the information whether or not a model in Gemini supports image generation.

Sample Code

Text Generation

import os

from google import genai

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Explain how AI works",
    config=types.GenerateContentConfig(
      temperature=0.5,
    ),
)

Image Generation

from io import BytesIO

from google import genai  # type: ignore[import]
from google.genai import types  # type: ignore[import]
from PIL import Image

client = genai.Client(vertexai=True, location="us-central1")

response = client.models.generate_images(
    model="imagen-3.0-generate-002",
    prompt="Fuzzy bunnies in my kitchen",
    config=types.GenerateImagesConfig(
        number_of_images=4,
    ),
)
for generated_image in response.generated_images:
    image = Image.open(BytesIO(generated_image.image.image_bytes))
    image.show()

@yu-iskw
Copy link
Author

yu-iskw commented Feb 14, 2025

NOTE We can get information of model. If we use Gemini API, we can get information of supported_actions as predict and generateContent. However, the API for Vertex AI returns less information.

from google import genai
import os
import json

from pprint import pprint

models = [
    "gemini-1.5-flash",
    "gemini-1.5-pro",
    "gemini-2.0-flash",
    "imagen-3.0-generate-002",
    "text-embedding-004",
]

# Gemini API
print("==================== Gemini API ====================")
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
for model_name in models:
    model = client.models.get(model=model_name)
    pprint(f"{model_name}: {json.dumps(model.to_json_dict(), indent=2)}")

# Vertex AI
print("==================== Vertex AI ====================")
client = genai.Client(vertexai=True, location="us-central1")
for model_name in models:
    model = client.models.get(model=model_name)
    pprint(f"{model_name}: {json.dumps(model.to_json_dict(), indent=2)}")
Model Information (Gemini API)
==================== Gemini API ====================
('gemini-1.5-flash: {\n'
 '  "name": "models/gemini-1.5-flash",\n'
 '  "display_name": "Gemini 1.5 Flash",\n'
 '  "description": "Alias that points to the most recent stable version of '
 'Gemini 1.5 Flash, our fast and versatile multimodal model for scaling across '
 'diverse tasks.",\n'
 '  "version": "001",\n'
 '  "tuned_model_info": {},\n'
 '  "input_token_limit": 1000000,\n'
 '  "output_token_limit": 8192,\n'
 '  "supported_actions": [\n'
 '    "generateContent",\n'
 '    "countTokens"\n'
 '  ]\n'
 '}')
('gemini-1.5-pro: {\n'
 '  "name": "models/gemini-1.5-pro",\n'
 '  "display_name": "Gemini 1.5 Pro",\n'
 '  "description": "Stable version of Gemini 1.5 Pro, our mid-size multimodal '
 'model that supports up to 2 million tokens, released in May of 2024.",\n'
 '  "version": "001",\n'
 '  "tuned_model_info": {},\n'
 '  "input_token_limit": 2000000,\n'
 '  "output_token_limit": 8192,\n'
 '  "supported_actions": [\n'
 '    "generateContent",\n'
 '    "countTokens"\n'
 '  ]\n'
 '}')
('gemini-2.0-flash: {\n'
 '  "name": "models/gemini-2.0-flash",\n'
 '  "display_name": "Gemini 2.0 Flash",\n'
 '  "description": "Gemini 2.0 Flash",\n'
 '  "version": "2.0",\n'
 '  "tuned_model_info": {},\n'
 '  "input_token_limit": 1048576,\n'
 '  "output_token_limit": 8192,\n'
 '  "supported_actions": [\n'
 '    "generateContent",\n'
 '    "countTokens",\n'
 '    "bidiGenerateContent"\n'
 '  ]\n'
 '}')
('imagen-3.0-generate-002: {\n'
 '  "name": "models/imagen-3.0-generate-002",\n'
 '  "display_name": "Imagen 3.0 002 model",\n'
 '  "description": "Vertex served Imagen 3.0 002 model",\n'
 '  "version": "002",\n'
 '  "tuned_model_info": {},\n'
 '  "input_token_limit": 480,\n'
 '  "output_token_limit": 8192,\n'
 '  "supported_actions": [\n'
 '    "predict"\n'
 '  ]\n'
 '}')
('text-embedding-004: {\n'
 '  "name": "models/text-embedding-004",\n'
 '  "display_name": "Text Embedding 004",\n'
 '  "description": "Obtain a distributed representation of a text.",\n'
 '  "version": "004",\n'
 '  "tuned_model_info": {},\n'
 '  "input_token_limit": 2048,\n'
 '  "output_token_limit": 1,\n'
 '  "supported_actions": [\n'
 '    "embedContent"\n'
 '  ]\n'
 '}')
Model Information (Vertex AI)
==================== Vertex AI ====================
('gemini-1.5-flash: {\n'
 '  "name": "publishers/google/models/gemini-1.5-flash",\n'
 '  "version": "default",\n'
 '  "tuned_model_info": {}\n'
 '}')
('gemini-1.5-pro: {\n'
 '  "name": "publishers/google/models/gemini-1.5-pro",\n'
 '  "version": "default",\n'
 '  "tuned_model_info": {}\n'
 '}')
('gemini-2.0-flash: {\n'
 '  "name": "publishers/google/models/gemini-2.0-flash",\n'
 '  "version": "default",\n'
 '  "tuned_model_info": {}\n'
 '}')
('imagen-3.0-generate-002: {\n'
 '  "name": "publishers/google/models/imagen-3.0-generate-002",\n'
 '  "version": "default",\n'
 '  "tuned_model_info": {}\n'
 '}')
('text-embedding-004: {\n'
 '  "name": "publishers/google/models/text-embedding-004",\n'
 '  "version": "default",\n'
 '  "tuned_model_info": {}\n'
 '}')

@ekzhu
Copy link
Collaborator

ekzhu commented Feb 14, 2025

@siscanu please provide your feedback to this PR here.

@chengyu-liu-cs
Copy link

Another thing to take into account is how to use tools provided by google in AutoGen together with other FunctionTools.

Just for curiosity, do other client (like openAIchatcompletion) supports the use of tools provided by google or others ? Or users need to create a wrapper function ?

@yu-iskw
Copy link
Author

yu-iskw commented Feb 20, 2025

Another thing to take into account is how to use tools provided by google in AutoGen together with other FunctionTools.

Just for curiosity, do other client (like openAIchatcompletion) supports the use of tools provided by google or others ? Or users need to create a wrapper function ?

First of all, we should enable users to use AutoGen tools with the client. On top of that, it would be good to discuss how to support tools of google-genai with this AutoGen Gemini client.

  1. Implement a function/class to convert a Gemini tool to a AutoGen tool.
  2. Enable to directly bind Gemini tools to this Gemini client.

@yu-iskw
Copy link
Author

yu-iskw commented Feb 20, 2025

I am still working on it. But, the direction of how to implement the client is getting clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants