-
Notifications
You must be signed in to change notification settings - Fork 124
Agent SDK - ported on top of components-js primatives #1207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ce80b15
to
ef0fed7
Compare
packages/core/src/messages/types.ts
Outdated
type ReceivedMessageWithType< | ||
Type extends string, | ||
Metadata extends {} = {}, | ||
> = { | ||
id: string; | ||
timestamp: number; | ||
|
||
type: Type; | ||
|
||
from?: Participant; | ||
attributes?: Record<string, string>; | ||
} & Metadata; | ||
|
||
/** @public */ | ||
export type ReceivedChatMessage = ReceivedMessageWithType<'chatMessage', ChatMessage & { | ||
from?: Participant; | ||
attributes?: Record<string, string>; | ||
}>; | ||
|
||
export type ReceivedUserTranscriptionMessage = ReceivedMessageWithType<'userTranscript', { | ||
message: string; | ||
}>; | ||
|
||
export type ReceivedAgentTranscriptionMessage = ReceivedMessageWithType<'agentTranscript', { | ||
message: string; | ||
}>; | ||
|
||
/** @public */ | ||
export type ReceivedMessage = | ||
| ReceivedUserTranscriptionMessage | ||
| ReceivedAgentTranscriptionMessage | ||
| ReceivedChatMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported the existing ReceivedMessage
abstraction from here on top of the pre-existing ReceivedChatMessage
- this means that now,ReceivedChatMessage
is now a ReceivedMessage
subtype.
Note ReceivedChatMessage
has one new type
field addition which acts as the discriminant key in ReceivedMessage
, but otherwise is identical. So this should be a fully backwards compatible change even though behind the scenes a lot has been updated.
packages/react/src/TokenSource.ts
Outdated
import { RoomConfiguration } from '@livekit/protocol'; | ||
import { decodeJwt } from 'jose'; | ||
|
||
const ONE_SECOND_IN_MILLISECONDS = 1000; | ||
const ONE_MINUTE_IN_MILLISECONDS = 60 * ONE_SECOND_IN_MILLISECONDS; | ||
|
||
/** | ||
* TokenSource handles getting credentials for connecting to a new Room, caching | ||
* the last result and using it until it expires. */ | ||
export abstract class TokenSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note - for now I copied the token source code into here just for experimenting around, because as of mid september 2025, it still hasn't been merged. This will be removed before these changes get published.
|
||
/** State representing the current status of the agent, whether it is ready for speach, etc */ | ||
export type AgentStateNew = 'unset' | 'initializing' | 'failed' | 'idle' | 'listening' | 'thinking' | 'speaking'; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thinking needs to be done about this type name and some of its implications. AgentState
is already being exported from this package by useVoiceAssistant
, so it can't be the same thing.
Previously I had been using the same AgentState
value, but @lukasIO pushed back on that in a previous comment so I split it out into two different values.
To disambiguate: the AgentState
value here is different because useVoiceAssistant
's AgentState
currently conflates state related to the room connection lifecycle with the agent lifecycle. So useVoiceAssistant
's AgentState
has values like connecting
and the "new agent state" omits those, but has different new states that are specific to the new implementation like failed
and unset
(kinda a catch all state for when the room isn't fully initialized and the agent hasn't started initializing yet).
So either this needs to be named something completely new or maybe I need to recombine all the old room connection related agent state values into the new value so it can be a strict superset, and therefore wouldn't be a backwards incompatible change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I had been using the same AgentState value, but @lukasIO pushed back on that livekit-examples/agent-starter-react#237 (comment) so I split it out into two different values.
maybe this was a misunderstanding, my comment was meant to refer to a room connection state (-> your local connection state which would indicate local connection problems) vs an agent state.
Can you elaborate on when failed
and unset
would be readable ?
If an agent isn't yet present, the agent
object wouldn't be present either, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an agent isn't yet present, the agent object wouldn't be present either, right?
This is incorrect - because useAgent
is hook, it always is possible for a user to call it no matter the state of the conversation. I suppose useAgent(conversationWithDisconnectedRoom)
could return null
in this case, but I opted to always return an object so it could be destructured rather than having to deal with the optional property accesses that returning a null
at the root would necessitate. This is what the unset
value is used to represent.
The useVoiceAssistant
hook doesn't have this problem, because in effect its "unset" value is instead proxying the room.connectionState
value - ie, conflating the two concepts. Here's how the two statuses interrelate.
Can you elaborate on when failed and unset would be readable ?
If a room isn't yet fully connected, then right now an agent's state is unset
. So for example:
const conversation = useConversationWith('agent name to dispatch', { tokenSource });
// Note: no room connection logic is happening here, so conversation.connectionState is "disconnected"
const agent = useAgent(conversation);
console.log(agent.state); // "unset"
If an agent never connects, then the state can go into failed
(I left it generic so more "failures" could eventually be captured, but right now it's just agent timeout). For example:
const conversation = useConversationWith('agent name to dispatch', { tokenSource });
useEffect(() => { conversation.start() }, []); // Connect to room / dispatch agent
const agent = useAgent(conversation);
// Initially:
console.log(agent.state); // "connecting"
// After a delay, this could potentially happen:
console.log(agent.state); // "failed"
console.log(agent.failureReasons); // ["Agent did not join the room."]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I opted to always return an object so it could be destructured rather than having to deal with the optional property accesses that returning a null at the root would necessitate.
The main upside I see with the approach you're taking is that it would allow for "pretending" an agent is present in preconnectbuffer usage.
However this would in turn not make sense if – when preconnect buffer is enabled – the value would be unset
and not listening
.
iiuc this simply shifts the problem from optional access to state check?
given that this also means that you'd have to check against two "non available" states of the agent it doesn't make this pattern super obvious to me.
// Option 1
const agent = useAgent(conversation) as Agent | undefined;
return <> { agent ? <MyAgentComponent camera={agent.camera} : <p>'Waiting for agent to connect'</p>} </>
// Option 2
```ts
const agent = useAgent(conversation) as Agent;
return <> { agent.state !== 'failed' && agent.state !== 'unset' ? <MyAgentComponent camera={agent.camera} : <p>'Waiting for agent to connect'</p> } </>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I have been opting to do in that case is something closer to option 1, but with checking the agent camera directly and not using a state check in that case, since as you have identified including / excluding certain states like option 2 is awkward. So effectively I'm "pushing down" that null
further than what you are doing in 1.
So like:
const agent = useAgent(conversation);
// or:
// const { camera } = useAgent(conversation);
return (
<>
{agent.camera ? (
<MyAgentComponent camera={agent.camera} />
) : (
<p>Waiting for agent...</p>
)
</>
);
Also keep in mind that before an agent is fully connected, there may be other properties available on it other than camera
- returning null
like you are doing above means that there's no way to access other values in there. But, if you don't return a top level null
, then it means the other side of that tradeoff requires an object to be returned, resulting in this AgentState
problem as the other side of that tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other properties would those be?
the camera
option is just an example. Imagine a text only agent that's not publishing anything itself, but you'd still want to check for its presence before doing/rendering some component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to check for the agent's presence, that's where the is
-prefixed boolean properties come in, which effectively check sets of states. In particular I think isAvailable
is what you are looking for in this case. So:
const agent = useAgent(conversation);
return (
<>
{agent.isAvailable ? (
<p>Agent is ready for user interaction</p>
) : (
<p>Waiting for agent...</p>
)
</>
);
I largely copied this pattern from react-query
, fwiw: https://tanstack.com/query/latest/docs/framework/react/reference/useQuery
(it's worth noting that agent.camera
still remains nullable even when agent.isAvailable
is asserted to be true, since an agent may not actually emit any sort of video. So what I mentioned in my last message still probably is the right thing to do for rendering a VideoTrack
like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @lukasIO on a call - we decided to migrate unset
-> connecting
, and then add the failed
status into the existing AgentState
list.
Here is what I have now:
/** @see https://github.com/livekit/agents/blob/65170238db197f62f479eb7aaef1c0e18bfad6e7/livekit-agents/livekit/agents/voice/events.py#L97 */
type AgentSdkStates = 'initializing' | 'idle' | 'listening' | 'thinking' | 'speaking';
/**
* State representing the current status of the agent, whether it is ready for speach, etc
*
* For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle:
* connecting -> listening -> listening/thinking/speaking
*
* For agents without the preconnect audio feature enabled:
* connecting -> initializing -> idle/listening/thinking/speaking
*
* If an agent fails to connect:
* connecting -> listening/initializing -> failed
*
* Legacy useVoiceAssistant hook:
* disconnected -> connecting -> initializing -> listening/thinking/speaking
* */
export type AgentState = 'disconnected' | 'connecting' | 'failed' | AgentSdkStates;
Also, we decided adding a isPreConnectBufferEnabled
type boolean when in the listening
state would probably be important - while most use cases probably people wouldn't care about this, it seems like it would be good to be able to know in a subset of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally sounds good to me, to minor things:
- for the use case you outlined
isPreConnectBufferEnabled
isn't ideal as users cannot differentiate between differentlistening
states easily. I think a flag that indicates that we're in a preliminary listening state would be preferable. For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle: connecting -> listening -> listening/thinking/speaking
– to clarify, why isinitializing
missing here? I think the agent could still setinitializing
on itself ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the use case you outlined isPreConnectBufferEnabled isn't ideal as users cannot differentiate between different listening states easily. I think a flag that indicates that we're in a preliminary listening state would be preferable.
Discussed on call - what I have now actually has this behavior sort of by accident 😬 , but we couldn't come up with a good name for what this should be called at the time. With some more pondering I'm thinking maybe isBufferingSpeech
, so agent.state === "listening" && agent.isBufferingSpeech
. I'm going to go with that unless I hear objections to that name.
For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle: connecting -> listening -> listening/thinking/speaking – to clarify, why is initializing missing here? I think the agent could still set initializing on itself ? 🤔
Also discussed on a call - I didn't include it because I think in (almost?) all cases the preconnect buffer listening would occur while that initialization step is running but I suppose it technically is a race condition and maybe could be possible, so I updated the comment to add it in.
how about proxying some of these on the return value of |
I opted to leave that out for now because of the hesitancy from ben/dz around new track abstractions. That being said, I mentioned in the comment above that I had been proposing a new hook, It sounds like you are pushing for that to exist now vs deferring it? If so I can add that new hook to this branch or possibly figure out how to fit it into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, i like the direction where this is going.
}; | ||
|
||
type ConversationStateCommon = { | ||
subtle: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this naming. what is the intent for exposing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was largely my ask, Ryan also felt it was a bit awkward.
The idea behind it:
- expose high level APIs directly on the return type, but hide "advanced" use cases behind a common name
- while I agree the name doesn't feel super obvious,
subtle
as a name stems from the prior use of it in that context e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we actually simplifying it by putting these things in another sub namespace? I could go either way on this, but initially it seems to be higher in cognitive load.. wondering why things are done one way vs another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's definitely not simplifying things if we expect users to regularly be using the things under the sub namespace.
It only makes sense to introduce it if we treat the things under the sub namespace as a mere escape hatch for users who know what they're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we either find a more meaningful grouping, or just keep them flat. the "advanced" vs "not advanced" line is pretty hard to draw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some of them (e.g. the emitter) I think it would be good to be hidden under a name after discussing this more with @1egoman .
some alternative suggestions:
advanced
raw
internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did two things to try to address this:
- Renamed
subtle
everywhere tointernal
- this key contains values that need to be exported so other hooks / logic can tap into it (so they can't just be internal implementation details of the hook / not exposed in the return value), but are not expected for users to actually use (though they could in theory use them, but with similar types of pitfalls as accessing_
-prefixed properties in python) - Moved
room
fromuseSession().subtle.room
touseSession().room
, since this is expected for users to access / is not internal.
This allows functions to limit whether they just want to take in TrackReferences from a given source - ie, the VideoTrack could be made to only accept TrackReference<Track.Source.Camera | Track.Source.Screenshare | Track.Source.Unknown>.
Note that just the return values are changing, not the argument definitions in other spots, so this shouldn't be a backwards compatibility issue.
…viously didn't accept it
The pre-existing state was broken.
…in useParticipantTracks
…s going to be important for future multi-agent type use cases
…enSourceOptions are always being checked for equality
This change comprises the new client agents sdk, a set of react hooks that are being built to make interaction with the livekit agents framework less complex.
This is version 3 - version 1 can be found here, and version 2 can be found here. Each step it has evolved significantly based on comments and perspectives from people who have taken a look!
Single file example
New API surface area
useConversationWith(agentName: string, options: UseConversationWithOptions): Conversation
A thin wrapper around a
Room
which handles connecting to a room and dispatching a given agent into that room (or in the future, maybe multiple agents?). In the future it will probably become thicker as more global agent state is required.useAgent(conversation: Conversation): Agent
A much more advanced version of the previously existing
useVoiceAssistant
hook - tracks the agent's state within the conversation, manages agent connection timeouts / other failures, and largely maintains backwards compatibility with existing interfaces.useConversationMessages
A mechanism for interacting with
ReceivedMessage
s across the whole conversation. AReceivedMessage
can be aReceivedChatMessage
(already exists today), or aReceivedUserTranscriptionMessage
/ReceivedAgentTranscriptionMessage
(both brand new). This is exposed at the conversation level so in a future world where multiple agents are within a conversation, this hook will return messages from all of themAdditional refactoring / cleanup
ParticipantAgentAttributes
constant and ported all usages oflk.
-prefixed attributes (which previously were just magic strings in the code) to refer to this enum.handleMediaDeviceError
callback function inuseLiveKitRoom
room
parameter to a few hooks and components that didn't support it previously, to make single file example type scenarios easier:RoomAudioRenderer
StartAudio
useChat
useTextStream
useTrackToggle
useTranscriptions