-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: draft initial implementation of Realtime API #10127
base: main
Are you sure you want to change the base?
Conversation
...ernel/connectors/ai/open_ai/prompt_execution_settings/open_ai_realtime_execution_settings.py
Show resolved
Hide resolved
python/semantic_kernel/connectors/ai/open_ai/services/open_ai_realtime_base.py
Outdated
Show resolved
Hide resolved
python/semantic_kernel/connectors/ai/open_ai/services/open_ai_realtime_base.py
Outdated
Show resolved
Hide resolved
python/semantic_kernel/connectors/ai/open_ai/services/open_ai_realtime_utils.py
Outdated
Show resolved
Hide resolved
f83002c
to
1b2eaaf
Compare
Python Test Coverage Report •
Python Unit Test Overview
|
4667bdc
to
20f5270
Compare
20f5270
to
2afa19f
Compare
python/semantic_kernel/connectors/ai/open_ai/services/realtime/open_ai_realtime_base.py
Show resolved
Hide resolved
content = data["item"] | ||
for item in content.items: | ||
match item: | ||
case TextContent(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There looks to be similar logic for handling the .CONVERSATION_ITEM_CREATE
SK types in both this class and the webrtc class. Would it be worth it to create a shared helper (maybe in OpenAIRealtimeBase?) or in a utils module to remove some duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought about that, but since websockets call methods in the OpenAI package in the next line, while webrtc sends dicts to the data channel it will mostly complicate typing etc.
pass | ||
|
||
@override | ||
async def start_sending(self, **kwargs: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the start_sending
child classes: is there anything we'd need to add to better handle shutdowns? or when there is no more data to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both sending and listening are loops that execute until they are stopped and both just react to things coming in, we could look at a bit nicer propagation of a session close, but for sending that would just mean the queue is empty, but it can be empty during the session, while the service keeps sending until we call close, so we don't then want to keep going either (I think)
- developer judgement needs to be made (or exposed with parameters) on what is returned through the async generator and what is passed to the event handlers | ||
|
||
### 2. Event buffers/queues that are exposed to the developer, start sending and start receiving methods, that just initiate the sending and receiving of events and thereby the filling of the buffers | ||
This would mean that the there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that the there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well. | |
This would mean that there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well. |
|
||
# Content and Events | ||
|
||
## Considered Options - Content and Events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call out whether the “control” versus “content” distinction is a fundamental part of real-time interaction or just an implementation detail? For example, OpenAI distinguishes control events (input_audio_buffer.committed
) from content events (conversation.item.create
), while Google appears to treat everything as part of a unified content stream (BidiGenerateContent*
).
This distinction might influence our decision in a few ways:
- If the distinction is inherent to real-time systems, separating control from content may result in a cleaner, more flexible design.
- However, if it’s just a specific quirk of OpenAI’s API, enforcing it could complicate support for providers like Google that don’t make the same distinction.
- On the other hand, ignoring OpenAI’s finer-grained controls might limit the ability to fully utilize other features in the future.
I think it would make sense to call this out explicitly in the doc and could provide additional context for why we’re choosing one approach over the other.
Motivation and Context
Implements the OpenAI Realtime API with Semantic Kernel
Description
Implements a separate Service Client class with its own ExecutionSettings, but still based on ChatCompletionClientBase.
Only support streaming operations with additional public methods for sending data to the conversation.
TBD if that is the way to move forward with it.
TODO:
Contribution Checklist