From 73cd17723ca119de9b632809156a9c4e5ae98a0a Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Thu, 12 Dec 2024 23:38:49 +0000 Subject: [PATCH 1/7] feat: Add intro to multimodal live api notebook --- .../intro_multimodal_live_api.ipynb | 621 ++++++++++++++++++ 1 file changed, 621 insertions(+) create mode 100644 gemini/multimodal-live-api/intro_multimodal_live_api.ipynb diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb new file mode 100644 index 00000000000..a10dc893b03 --- /dev/null +++ b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb @@ -0,0 +1,621 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oXnEutuDQa9c" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Getting Started with the Multimodal Live API in Vertex AI\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Google
Open in Colab\n", + "
\n", + "
\n", + " \n", + " \"Google
Open in Colab Enterprise\n", + "
\n", + "
\n", + " \n", + " \"Vertex
Open in Vertex AI Workbench\n", + "
\n", + "
\n", + " \n", + " \"GitHub
View on GitHub\n", + "
\n", + "
\n", + "\n", + "
\n", + "\n", + "Share to:\n", + "\n", + "\n", + " \"LinkedIn\n", + "\n", + "\n", + "\n", + " \"Bluesky\n", + "\n", + "\n", + "\n", + " \"X\n", + "\n", + "\n", + "\n", + " \"Reddit\n", + "\n", + "\n", + "\n", + " \"Facebook\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "84f0f73a0f76" + }, + "source": [ + "| | |\n", + "|-|-|\n", + "| Author(s) | [Eric Dong](https://github.com/gericdong), [Holt Skinner](https://github.com/holtskinner) |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. This tutorial demonstrates the following simple examples to help you get started with the Multimodal Live API in Vertex AI.\n", + "\n", + "- Text-to-text generation\n", + "- Text-to-audio generation\n", + "- Text-to-audio conversation\n", + "\n", + "See the [Multimodal Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) page for more details." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gPiTOAHURvTM" + }, + "source": [ + "## Getting Started" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CHRZUpfWSEpp" + }, + "source": [ + "### Install libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sG3_LKsWSD3A" + }, + "outputs": [], + "source": [ + "%pip install --quiet websockets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HlMVjiAWSMNX" + }, + "source": [ + "### Authenticate your notebook environment (Colab only)\n", + "\n", + "If you are running this notebook on Google Colab, run the cell below to authenticate your environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "12fnq4V0SNV3" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "if \"google.colab\" in sys.modules:\n", + " from google.colab import auth\n", + "\n", + " auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "41oBMp0YraPr" + }, + "source": [ + "### Set Google Cloud project information\n", + "\n", + "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mRv0DzDjrigq" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\"}\n", + "LOCATION = \"us-central1\" # @param {type: \"string\"}\n", + "\n", + "HOST = \"us-central1-aiplatform.googleapis.com\"\n", + "SERVICE_URL = f\"wss://{HOST}/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N1mXI5aDB8vl" + }, + "source": [ + "## Generate an access token\n", + "\n", + "`gcloud auth application-default print-access-token` generates and prints an access token for the current Application Default Credential. The default access token lifetime is 3600 seconds. This access token will be used to connect to the WebSocket server." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Nb_bHsEhe-37" + }, + "outputs": [], + "source": [ + "bearer_token = !gcloud auth application-default print-access-token" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5M7EKckIYVFy" + }, + "source": [ + "### Use the Gemini 2.0 Flash model\n", + "\n", + "Multimodal Live API is a new capability introduced with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-coEslfWPrxo" + }, + "outputs": [], + "source": [ + "MODEL_ID = \"gemini-2.0-flash-exp\" # @param {type: \"string\"}\n", + "\n", + "MODEL = (\n", + " f\"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Ef0zVX-X9Bg" + }, + "source": [ + "### Import libraries\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QNxC25Pg4Hfr" + }, + "outputs": [], + "source": [ + "import base64\n", + "import json\n", + "\n", + "from IPython.display import Audio, Markdown, display\n", + "import numpy as np\n", + "from websockets.asyncio.client import connect" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k9jAArxzClXz" + }, + "source": [ + "## Use the Multimodal Live API\n", + "\n", + "Multimodal Live API is a stateful API that uses WebSockets. This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q1DE3s_LIUuE" + }, + "source": [ + "### **Example 1**: Text-to-text generation\n", + "\n", + "You send a text prompt and receive a text message.\n", + "\n", + "**Notes**\n", + "- A session `ws` represents a single WebSocket connection between the client and the server.\n", + "- After a new connection is initiated, the session can exchange messages with the server.\n", + "- Messages are JSON-formatted strings exchanged over the WebSocket connection.\n", + "- The first message to be sent should be a [`setup`](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live#bidigeneratecontentsetup) message that contains the model, generation parameters, system instructions, and tools.\n", + " - `response_modalities` accepts `TEXT` or `AUDIO`.\n", + "- To receive messages from the server, listen for the WebSocket `message` event, and then parse the result according to the definition of supported server messages.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sAFvSv2IecqA" + }, + "outputs": [], + "source": [ + "# Set model generation_config\n", + "CONFIG = {\"response_modalities\": [\"TEXT\"]}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\",\n", + " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + "}\n", + "\n", + "# Connect to the server\n", + "async with connect(SERVICE_URL, additional_headers=headers) as ws:\n", + " # Setup the session\n", + " await ws.send(\n", + " json.dumps(\n", + " {\n", + " \"setup\": {\n", + " \"model\": MODEL,\n", + " \"generation_config\": CONFIG,\n", + " }\n", + " }\n", + " )\n", + " )\n", + "\n", + " # Receive setup response\n", + " raw_response = await ws.recv(decode=False)\n", + " setup_response = json.loads(raw_response.decode(\"ascii\"))\n", + "\n", + " # Send text message\n", + " text_input = \"Hello? Gemini are you there?\"\n", + " display(Markdown(f\"**Input:** {text_input}\"))\n", + "\n", + " msg = {\n", + " \"client_content\": {\n", + " \"turns\": [{\"role\": \"user\", \"parts\": [{\"text\": text_input}]}],\n", + " \"turn_complete\": True,\n", + " }\n", + " }\n", + "\n", + " await ws.send(json.dumps(msg))\n", + "\n", + " responses = []\n", + "\n", + " # Receive chucks of server response\n", + " async for raw_response in ws:\n", + " response = json.loads(raw_response.decode())\n", + " server_content = response.pop(\"serverContent\", None)\n", + " if server_content is None:\n", + " break\n", + "\n", + " model_turn = server_content.pop(\"modelTurn\", None)\n", + " if model_turn is not None:\n", + " parts = model_turn.pop(\"parts\", None)\n", + " if parts is not None:\n", + " responses.append(parts[0][\"text\"])\n", + "\n", + " # End of turn\n", + " turn_complete = server_content.pop(\"turnComplete\", None)\n", + " if turn_complete:\n", + " break\n", + "\n", + " # Print the server response\n", + " display(Markdown(f\"**Response >** {''.join(responses)}\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cG3346aA9sRR" + }, + "source": [ + "## **Example 2**: Text-to-audio generataion\n", + "\n", + "You send a text prompt and receive a model response in audio.\n", + "\n", + "**Notes**\n", + "- Multimodal Live API supports the following voices:\n", + " - Puck\n", + " - Charon\n", + " - Kore\n", + " - Fenrir\n", + " - Aoede\n", + "- To specify a voice, set the voice_name within the speech_config object, as part of your session configuration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Iz3OkQ-a51QM" + }, + "outputs": [], + "source": [ + "# Set model generation_config\n", + "CONFIG = {\n", + " \"response_modalities\": [\"AUDIO\"],\n", + " \"speech_config\": {\n", + " \"voice_config\": {\"prebuilt_voice_config\": {\"voice_name\": \"Aoede\"}}\n", + " },\n", + "}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\",\n", + " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + "}\n", + "\n", + "# Connect to the server\n", + "async with connect(SERVICE_URL, additional_headers=headers) as ws:\n", + " # Setup the session\n", + " await ws.send(\n", + " json.dumps(\n", + " {\n", + " \"setup\": {\n", + " \"model\": MODEL,\n", + " \"generation_config\": CONFIG,\n", + " }\n", + " }\n", + " )\n", + " )\n", + "\n", + " # Receive setup response\n", + " raw_response = await ws.recv(decode=False)\n", + " setup_response = json.loads(raw_response.decode(\"ascii\"))\n", + "\n", + " # Send text message\n", + " text_input = \"Hello? Gemini are you there?\"\n", + " display(Markdown(f\"**Input:** {text_input}\"))\n", + "\n", + " msg = {\n", + " \"client_content\": {\n", + " \"turns\": [{\"role\": \"user\", \"parts\": [{\"text\": text_input}]}],\n", + " \"turn_complete\": True,\n", + " }\n", + " }\n", + "\n", + " await ws.send(json.dumps(msg))\n", + "\n", + " responses = []\n", + "\n", + " # Receive chucks of server response\n", + " async for raw_response in ws:\n", + " response = json.loads(raw_response.decode())\n", + " server_content = response.pop(\"serverContent\", None)\n", + " if server_content is None:\n", + " break\n", + "\n", + " model_turn = server_content.pop(\"modelTurn\", None)\n", + " if model_turn is not None:\n", + " parts = model_turn.pop(\"parts\", None)\n", + " if parts is not None:\n", + " for part in parts:\n", + " pcm_data = base64.b64decode(part[\"inlineData\"][\"data\"])\n", + " responses.append(np.frombuffer(pcm_data, dtype=np.int16))\n", + "\n", + " # End of turn\n", + " turn_complete = server_content.pop(\"turnComplete\", None)\n", + " if turn_complete:\n", + " break\n", + "\n", + " # Play the returned audio message\n", + " display(Audio(np.concatenate(responses), rate=24000, autoplay=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JOBlWf566HOx" + }, + "source": [ + "## **Example 3**: Text-to-audio conversation\n", + "\n", + "**Step 1**: You set up a conversation with the API that allows you to send text prompts and receive audio responses.\n", + "\n", + "**Notes**\n", + "\n", + "- While the model keeps track of in-session interactions, explicit session history accessible through the API isn't available yet. When a session is terminated the corresponding context is erased.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-eu4QL0L3HZi" + }, + "outputs": [], + "source": [ + "# Set model generation_config\n", + "CONFIG = {\"response_modalities\": [\"AUDIO\"]}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\",\n", + " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + "}\n", + "\n", + "\n", + "async def main() -> None:\n", + " # Connect to the server\n", + " async with connect(SERVICE_URL, additional_headers=headers) as ws:\n", + "\n", + " # Setup the session\n", + " async def setup() -> None:\n", + " await ws.send(\n", + " json.dumps(\n", + " {\n", + " \"setup\": {\n", + " \"model\": MODEL,\n", + " \"generation_config\": CONFIG,\n", + " }\n", + " }\n", + " )\n", + " )\n", + "\n", + " # Receive setup response\n", + " raw_response = await ws.recv(decode=False)\n", + " setup_response = json.loads(raw_response.decode(\"ascii\"))\n", + " print(f\"Connected: {setup_response}\")\n", + " return\n", + "\n", + " # Send text message\n", + " async def send() -> bool:\n", + " text_input = input(\"Input > \")\n", + " if text_input.lower() in (\"q\", \"quit\", \"exit\"):\n", + " return False\n", + "\n", + " msg = {\n", + " \"client_content\": {\n", + " \"turns\": [{\"role\": \"user\", \"parts\": [{\"text\": text_input}]}],\n", + " \"turn_complete\": True,\n", + " }\n", + " }\n", + "\n", + " await ws.send(json.dumps(msg))\n", + " return True\n", + "\n", + " # Receive server response\n", + " async def receive() -> None:\n", + " responses = []\n", + "\n", + " # Receive chucks of server response\n", + " async for raw_response in ws:\n", + " response = json.loads(raw_response.decode())\n", + " server_content = response.pop(\"serverContent\", None)\n", + " if server_content is None:\n", + " break\n", + "\n", + " model_turn = server_content.pop(\"modelTurn\", None)\n", + " if model_turn is not None:\n", + " parts = model_turn.pop(\"parts\", None)\n", + " if parts is not None:\n", + " for part in parts:\n", + " pcm_data = base64.b64decode(part[\"inlineData\"][\"data\"])\n", + " responses.append(np.frombuffer(pcm_data, dtype=np.int16))\n", + "\n", + " # End of turn\n", + " turn_complete = server_content.pop(\"turnComplete\", None)\n", + " if turn_complete:\n", + " break\n", + "\n", + " # Play the returned audio message\n", + " display(Markdown(\"**Response >**\"))\n", + " display(Audio(np.concatenate(responses), rate=24000, autoplay=True))\n", + " return\n", + "\n", + " await setup()\n", + "\n", + " while True:\n", + " if not await send():\n", + " break\n", + " await receive()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "94IeUUb3e90M" + }, + "source": [ + "**Step 2** Start the conversation, input your prompts, or type `q`, `quit` or `exit` to exit.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2UvgUDIYJqfw" + }, + "outputs": [], + "source": [ + "await main()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "usjiqTDXfk_6" + }, + "source": [ + "## What's next\n", + "\n", + "\n", + "- Try [getting started with the Multimodal Live API with the Gen AI SDK](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_genai_sdk.ipynb)\n", + "- Learn how to [build a web application that enables you to use your voice and camera to talk to Gemini 2.0 through the Multimodal Live API.](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/multimodal-live-api/websocket-demo-app)\n", + "- See the [Multimodal Live API reference docs](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live).\n", + "- Explore other notebooks in the [Google Cloud Generative AI GitHub repository](https://github.com/GoogleCloudPlatform/generative-ai)." + ] + } + ], + "metadata": { + "colab": { + "name": "intro_multimodal_live_api.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 36d7e8343f8327a05eeeec9dd12e15dd8725ab68 Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Fri, 13 Dec 2024 15:40:27 +0000 Subject: [PATCH 2/7] Add more content --- .../intro_multimodal_live_api.ipynb | 4 +- .../intro_multimodal_live_api_genai_sdk.ipynb | 67 ++++++++++++++----- 2 files changed, 53 insertions(+), 18 deletions(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb index a10dc893b03..08f3e69c32d 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb @@ -370,7 +370,7 @@ "id": "cG3346aA9sRR" }, "source": [ - "## **Example 2**: Text-to-audio generataion\n", + "### **Example 2**: Text-to-audio generation\n", "\n", "You send a text prompt and receive a model response in audio.\n", "\n", @@ -468,7 +468,7 @@ "id": "JOBlWf566HOx" }, "source": [ - "## **Example 3**: Text-to-audio conversation\n", + "### **Example 3**: Text-to-audio conversation\n", "\n", "**Step 1**: You set up a conversation with the API that allows you to send text prompts and receive audio responses.\n", "\n", diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb index 41f2f719b3b..60e081733ae 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb @@ -29,7 +29,7 @@ "id": "JAPoU8Sm5E6e" }, "source": [ - "# Multimodal Live API with Gen AI SDK\n", + "# Getting Started with the Multimodal Live API using Gen AI SDK\n", "\n", "\n", "\n", @@ -99,11 +99,11 @@ "source": [ "## Overview\n", "\n", - "The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Multimodal Live API is designed for server-to-server communication. This notebook demonstrates the following simple examples to help you get started with the Multimodal Live API using the Google Gen AI SDK in Vertex AI.\n", + "The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. This tutorial demonstrates the following simple examples to help you get started with the Multimodal Live API using the Google Gen AI SDK in Vertex AI.\n", "\n", - "- Text to text\n", - "- Text to audio\n", - "- Text to audio in a chat\n", + "- Text-to-text generation\n", + "- Text-to-audio generation\n", + "- Text-to-audio conversation\n", "\n", "See the [Multimodal Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) page for more details." ] @@ -175,7 +175,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 36, "metadata": { "id": "xBCH3hnAX9Bh" }, @@ -234,7 +234,7 @@ "id": "5M7EKckIYVFy" }, "source": [ - "### Load the Gemini 2.0 Flash model\n", + "### Use the Gemini 2.0 Flash model\n", "\n", "Multimodal Live API is a new capability introduced with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2)." ] @@ -253,12 +253,32 @@ { "cell_type": "markdown", "metadata": { - "id": "Q1DE3s_LIUuE" + "id": "b51c5ced31f7" }, "source": [ - "## **Example 1**: Text to text\n", + "## Use the Multimodal Live API\n", "\n", - "You send one text prompt and receive text response." + "Multimodal Live API is a stateful API that uses WebSockets. This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q1DE3s_LIUuE" + }, + "source": [ + "### **Example 1**: Text-to-text generation\n", + "\n", + "You send a text prompt and receive a text message.\n", + "\n", + "**Notes**\n", + "- A session `session` represents a single WebSocket connection between the client and the server.\n", + "- A session configuration includes the model, generation parameters, system instructions, and tools.\n", + " - `response_modalities` accepts `TEXT` or `AUDIO`.\n", + "- After a new session is initiated, the session can exchange messages with the server to\n", + " - Send text, audio, or video to the server.\n", + " - Receive audio, text, or function call responses from the server.\n", + "- When sending messages to the server, set `end_of_turn` to `True` to indicate that the server content generation should start with the currently accumulated prompt. Otherwise, the server awaits additional messages before starting generation." ] }, { @@ -295,9 +315,18 @@ "id": "cG3346aA9sRR" }, "source": [ - "## **Example 2**: Text to audio\n", - "\n", - "You send text prompts and receive responses in audio.\n" + "### **Example 2**: Text-to-audio generation\n", + "\n", + "You send a text prompt and receive a model response in audio.\n", + "\n", + "**Notes**\n", + "- Multimodal Live API supports the following voices:\n", + " - Puck\n", + " - Charon\n", + " - Kore\n", + " - Fenrir\n", + " - Aoede\n", + "- To specify a voice, set the voice_name within the speech_config object, as part of your session configuration.\n" ] }, { @@ -308,7 +337,9 @@ }, "outputs": [], "source": [ - "config = LiveConnectConfig(response_modalities=[\"AUDIO\"])\n", + "config = LiveConnectConfig(\n", + " response_modalities=[\"AUDIO\"],\n", + ")\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", @@ -338,9 +369,13 @@ "id": "JOBlWf566HOx" }, "source": [ - "## **Example 3**: Text to audio in a chat\n", + "### **Example 3**: Text-to-audio conversation\n", + "\n", + "**Step 1**: You set up a conversation with the API that allows you to send text prompts and receive audio responses.\n", + "\n", + "**Notes**\n", "\n", - "**Step 1**: You set up a chat with the API to answer your text prompts and return responses in audio." + "- While the model keeps track of in-session interactions, explicit session history accessible through the API isn't available yet. When a session is terminated the corresponding context is erased." ] }, { From 80189e732840ea12013daaa08edebbc7e9a488a0 Mon Sep 17 00:00:00 2001 From: Holt Skinner Date: Fri, 13 Dec 2024 11:20:48 -0600 Subject: [PATCH 3/7] Formatting --- gemini/multimodal-live-api/intro_multimodal_live_api.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb index 08f3e69c32d..104378aca55 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb @@ -306,7 +306,7 @@ "\n", "headers = {\n", " \"Content-Type\": \"application/json\",\n", - " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + " \"Authorization\": f\"Bearer {bearer_token[0]}\",\n", "}\n", "\n", "# Connect to the server\n", @@ -402,7 +402,7 @@ "\n", "headers = {\n", " \"Content-Type\": \"application/json\",\n", - " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + " \"Authorization\": f\"Bearer {bearer_token[0]}\",\n", "}\n", "\n", "# Connect to the server\n", @@ -490,7 +490,7 @@ "\n", "headers = {\n", " \"Content-Type\": \"application/json\",\n", - " \"Authorization\": \"Bearer {}\".format(bearer_token[0]),\n", + " \"Authorization\": f\"Bearer {bearer_token[0]}\",\n", "}\n", "\n", "\n", From 3609047243e2b1a67c9dd0d29f333d97524ad403 Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Fri, 13 Dec 2024 12:25:29 -0500 Subject: [PATCH 4/7] Update gemini/multimodal-live-api/intro_multimodal_live_api.ipynb Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> --- gemini/multimodal-live-api/intro_multimodal_live_api.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb index 104378aca55..81553b2230f 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb @@ -381,7 +381,7 @@ " - Kore\n", " - Fenrir\n", " - Aoede\n", - "- To specify a voice, set the voice_name within the speech_config object, as part of your session configuration.\n" + "- To specify a voice, set the `voice_name` within the `speech_config` object, as part of your session configuration.\n" ] }, { From ee7ccbc91b9f532c83111b69506a7ed97c14c22b Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Fri, 13 Dec 2024 12:25:36 -0500 Subject: [PATCH 5/7] Update gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> --- .../intro_multimodal_live_api_genai_sdk.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb index 60e081733ae..4016ca3ae6f 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb @@ -258,7 +258,7 @@ "source": [ "## Use the Multimodal Live API\n", "\n", - "Multimodal Live API is a stateful API that uses WebSockets. This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." + "Multimodal Live API is a stateful API that uses [WebSockets](https://en.wikipedia.org/wiki/WebSocket). This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." ] }, { From d4f1a546356a3f409e37cf6a5e72643ce290244b Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Fri, 13 Dec 2024 12:25:42 -0500 Subject: [PATCH 6/7] Update gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> --- .../intro_multimodal_live_api_genai_sdk.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb index 4016ca3ae6f..ab898eb2dcb 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb @@ -326,7 +326,7 @@ " - Kore\n", " - Fenrir\n", " - Aoede\n", - "- To specify a voice, set the voice_name within the speech_config object, as part of your session configuration.\n" + "- To specify a voice, set the `voice_name` within the `speech_config` object, as part of your session configuration.\n" ] }, { From 6d856853dfbd994fd4559e80350358e4dcb71df7 Mon Sep 17 00:00:00 2001 From: Eric Dong Date: Mon, 16 Dec 2024 09:01:09 -0500 Subject: [PATCH 7/7] Update gemini/multimodal-live-api/intro_multimodal_live_api.ipynb Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> --- gemini/multimodal-live-api/intro_multimodal_live_api.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb index 81553b2230f..b205d9daf0f 100644 --- a/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb +++ b/gemini/multimodal-live-api/intro_multimodal_live_api.ipynb @@ -271,7 +271,7 @@ "source": [ "## Use the Multimodal Live API\n", "\n", - "Multimodal Live API is a stateful API that uses WebSockets. This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." + "Multimodal Live API is a stateful API that uses [WebSockets](https://en.wikipedia.org/wiki/WebSocket). This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation." ] }, {