Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: draft initial implementation of Realtime API #10127

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

eavanvalkenburg
Copy link
Member

Motivation and Context

Implements the OpenAI Realtime API with Semantic Kernel

Description

Implements a separate Service Client class with its own ExecutionSettings, but still based on ChatCompletionClientBase.
Only support streaming operations with additional public methods for sending data to the conversation.
TBD if that is the way to move forward with it.

TODO:

  • lots of comments
  • tests
  • cleanup

Contribution Checklist

@eavanvalkenburg eavanvalkenburg requested a review from a team as a code owner January 8, 2025 16:04
@eavanvalkenburg eavanvalkenburg marked this pull request as draft January 8, 2025 16:04
@markwallace-microsoft markwallace-microsoft added the python Pull requests for the Python Semantic Kernel label Jan 8, 2025
@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Jan 9, 2025

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
semantic_kernel/connectors/ai
   chat_completion_client_base.py127298%408, 418
   function_calling_utils.py511080%156–181
   realtime_client_base.py31971%12, 41, 60–62, 134, 141–142, 146
semantic_kernel/connectors/ai/open_ai/services
   open_ai_realtime.py301067%28–30, 72–84
semantic_kernel/connectors/ai/open_ai/services/realtime
   open_ai_realtime_base.py974949%65–108, 118–134, 138–142, 148, 155–181, 187, 191, 197, 201
   open_ai_realtime_webrtc.py17013123%67–68, 72–156, 167–215, 220–227, 232–263, 271–279, 283–302
   open_ai_realtime_websocket.py1148030%57–86, 90–179, 189–192, 197–200
   utils.py7443%20–26, 36
semantic_kernel/connectors/ai/utils
   __init__.py220%3–5
   realtime_helpers.py1291290%3–218
semantic_kernel/contents
   audio_content.py25292%81, 86
   binary_content.py106992%80, 119, 137–138, 179–183
   function_call_content.py106298%197, 225
   streaming_chat_message_content.py71199%227
semantic_kernel/contents/utils
   data_uri.py101496%44–45, 63, 128
TOTAL17434221387% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
3010 4 💤 0 ❌ 0 🔥 1m 10s ⏱️

content = data["item"]
for item in content.items:
match item:
case TextContent():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There looks to be similar logic for handling the .CONVERSATION_ITEM_CREATE SK types in both this class and the webrtc class. Would it be worth it to create a shared helper (maybe in OpenAIRealtimeBase?) or in a utils module to remove some duplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about that, but since websockets call methods in the OpenAI package in the next line, while webrtc sends dicts to the data channel it will mostly complicate typing etc.

pass

@override
async def start_sending(self, **kwargs: Any) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the start_sending child classes: is there anything we'd need to add to better handle shutdowns? or when there is no more data to handle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both sending and listening are loops that execute until they are stopped and both just react to things coming in, we could look at a bit nicer propagation of a session close, but for sending that would just mean the queue is empty, but it can be empty during the session, while the service keeps sending until we call close, so we don't then want to keep going either (I think)

- developer judgement needs to be made (or exposed with parameters) on what is returned through the async generator and what is passed to the event handlers

### 2. Event buffers/queues that are exposed to the developer, start sending and start receiving methods, that just initiate the sending and receiving of events and thereby the filling of the buffers
This would mean that the there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This would mean that the there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well.
This would mean that there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well.

docs/decisions/00XX-realtime-api-clients.md Show resolved Hide resolved
docs/decisions/00XX-realtime-api-clients.md Show resolved Hide resolved
docs/decisions/00XX-realtime-api-clients.md Show resolved Hide resolved

# Content and Events

## Considered Options - Content and Events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call out whether the “control” versus “content” distinction is a fundamental part of real-time interaction or just an implementation detail? For example, OpenAI distinguishes control events (input_audio_buffer.committed) from content events (conversation.item.create), while Google appears to treat everything as part of a unified content stream (BidiGenerateContent*).

This distinction might influence our decision in a few ways:

  • If the distinction is inherent to real-time systems, separating control from content may result in a cleaner, more flexible design.
  • However, if it’s just a specific quirk of OpenAI’s API, enforcing it could complicate support for providers like Google that don’t make the same distinction.
  • On the other hand, ignoring OpenAI’s finer-grained controls might limit the ability to fully utilize other features in the future.

I think it would make sense to call this out explicitly in the doc and could provide additional context for why we’re choosing one approach over the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation python Pull requests for the Python Semantic Kernel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants