-
Notifications
You must be signed in to change notification settings - Fork 205
Agent SDK Prototype - React layer #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ced than I thought...
… like the receive end
…on the root AgentSession
…owing up in agent starter app
…gentLocalParticipant
The latest updates on your projects. Learn more about Vercel for GitHub.
|
export const createUseAgentSession = (options: AgentSessionOptions) => create<AgentSessionInstance>((set, get) => { | ||
return createAgentSession(options, get, set); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of providing a large list of smaller hooks, I opted to provide one single main hook useAgentSession
which can be used to access the agent state and all user facing actions.
General notes
- Behind the scenes, this is a zustand store, so
const connect = useAgentSession(session => session.connect);
can be used to get a smaller slice of the inner state. - To implement "component level" hooks for the future agents components, one would effectively just write a mapping function:
const useMyFancyComponent = () => useAgentSession(session => [session.connect, session.disconnect]);
. - While the interface here is primarily targeted towards react, it is possible to make a zustand store outside of a react context - see here. So if supporting non react platforms is important in the future (vanilla or other frameworks), that becomes the means to do it!
Backwards compatibility
At the moment this abstraction is optimized for use with the new agents specific components currently being discussed, or fully LLM generated components. However, if you wanted to use the connection management logic from this with existing components in components-js
, there's a pretty straightforward escape hatch:
import { RoomContext } from '@livekit/components-react';
// later, in a component:
const agentSession = useAgentSession();
return (
<RoomContext.Provider value={agentSession.subtle.room}>
{/* rest of the app here */}
</RoomContext.Provider>
);
It is unclear if this needs to be made more official and how that might be done if that needed to happen!
type AgentSessionInstanceCommon = { | ||
[Symbol.toStringTag]: "AgentSessionInstance", | ||
|
||
credentials: ConnectionCredentialsProvider; | ||
|
||
/** Returns a promise that resolves once the room connects. */ | ||
waitUntilConnected: (signal?: AbortSignal) => void; | ||
/** Returns a promise that resolves once the room disconnects */ | ||
waitUntilDisconnected: (signal?: AbortSignal) => void; | ||
|
||
agentConnectTimeout: { | ||
delayInMilliseconds: number; | ||
timeoutId: NodeJS.Timeout | null; | ||
} | null, | ||
|
||
prepareConnection: () => Promise<void>, | ||
connect: (options?: AgentSessionConnectOptions) => Promise<void>; | ||
disconnect: () => Promise<void>; | ||
|
||
canPlayAudio: boolean; | ||
startAudio: () => Promise<void>; | ||
|
||
subtle: { | ||
emitter: TypedEventEmitter<AgentSessionCallbacks>; | ||
room: Room; | ||
}; | ||
}; | ||
|
||
type AgentSessionInstanceConnecting = AgentSessionInstanceCommon & { | ||
connectionState: "connecting"; | ||
isConnected: false; | ||
isReconnecting: false; | ||
|
||
agent: AgentInstance | null; | ||
local: LocalInstance | null; | ||
messages: MessagesInstance | null; | ||
}; | ||
|
||
type AgentSessionInstanceConnected = AgentSessionInstanceCommon & { | ||
connectionState: "connected" | "reconnecting" | "signalReconnecting"; | ||
isConnected: true; | ||
isReconnecting: boolean; | ||
|
||
agent: AgentInstance; | ||
local: LocalInstance; | ||
messages: MessagesInstance; | ||
}; | ||
|
||
type AgentSessionInstanceDisconnected = AgentSessionInstanceCommon & { | ||
connectionState: "disconnected"; | ||
isConnected: false; | ||
isReconnecting: false; | ||
|
||
agent: null; | ||
local: null; | ||
messages: null; | ||
}; | ||
|
||
export type AgentSessionInstance = AgentSessionInstanceConnecting | AgentSessionInstanceConnected | AgentSessionInstanceDisconnected; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what the top level AgentSession
state looks like - ie, what useAgentSession()
would return by default.
export function useAgentEvents< | ||
Emitter extends TypedEventEmitter<EventMap>, | ||
EmitterEventMap extends (Emitter extends TypedEventEmitter<infer EM> ? EM : never), | ||
Event extends Parameters<Emitter["on"]>[0], | ||
Callback extends EmitterEventMap[Event], | ||
>( | ||
instance: { subtle: { emitter: Emitter } }, | ||
event: Event, | ||
handlerFn: Callback | undefined, | ||
dependencies?: React.DependencyList | ||
) { | ||
const fallback = useMemo(() => () => {}, []); | ||
const wrappedCallback = useCallback(handlerFn ?? fallback, dependencies ?? []); | ||
const callback = dependencies ? wrappedCallback : handlerFn; | ||
|
||
useEffect(() => { | ||
if (!callback) { | ||
return; | ||
} | ||
instance.subtle.emitter.on(event, callback); | ||
return () => { | ||
instance.subtle.emitter.off(event, callback); | ||
}; | ||
}, [instance.subtle.emitter, event, callback]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the main useAgentSession
hook, there is one other hook exposed, useAgentEvents
. This allows one to listen to events from within the agent sdk infrastructure, as these don't really fit into the existing declarative / zustand based paradigm.
For example:
const agentSession = useAgentSession();
useAgentEvents(agentSession.local.camera, LocalTrackEvent.ActiveDeviceChangeError, () => {
/* handle event here - switching camera devices failed */
});
Behind the scenes, this function accepts any object which has a .subtle.emitter
property and effectively acts as a declarative way to call .on()
/ .off()
.
export const AgentVideoTrack: React.FunctionComponent<{ | ||
className?: string, | ||
track: LocalTrackInstance<Track.Source.Camera | Track.Source.ScreenShare> | RemoteTrackInstance<Track.Source.Camera | Track.Source.ScreenShare>, | ||
} & React.HTMLAttributes<HTMLVideoElement>> = ({ track, ...rest }) => { | ||
// FIXME: imperative handle logic | ||
const mediaElementRef = useRef<HTMLVideoElement>(null); | ||
|
||
useEffect(() => { | ||
if (!mediaElementRef.current) { | ||
return; | ||
} | ||
const mediaElement = mediaElementRef.current; | ||
|
||
if (!track.enabled) { | ||
return; | ||
} | ||
|
||
let cleanup: (() => void) | null = null; | ||
(async () => { | ||
if (!track.isLocal) { | ||
// FIXME: intersection observer logic | ||
track.setSubscribed(true); | ||
await track.waitUntilSubscribed(); // FIXME: move inside of attachToMediaElement | ||
} | ||
|
||
cleanup = track.attachToMediaElement(mediaElement); | ||
})() | ||
|
||
return () => { | ||
if (!track.isLocal) { | ||
track.setSubscribed(false); | ||
} | ||
cleanup?.(); | ||
}; | ||
}, [track]); | ||
|
||
return ( | ||
<video | ||
ref={mediaElementRef} | ||
data-lk-local-participant={false} | ||
data-lk-source={track.source} | ||
data-lk-orientation={track.orientation} | ||
muted={true} | ||
// onClick={clickHandler} | ||
{...rest} | ||
/> | ||
); | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few components the agents sdk would need to export to do things that components-js
did previously, namely:
AgentVideoTrack
- Render a video track on screenAgentAudioTrack
- Render an audio track / auto play itAgentRoomAudioRenderer
- Render all audio tracks in the room usingAgentAudioTrack
For all three I kept the apis as close as possible to what they were before, except for now they take in AgentSession
state subslices (ie, tracks, the agent state, etc) based on what is needed.
// FIXME: Where would this kind of metadata come from for real? | ||
// const { message, hasBeenEdited, time, locale, name } = useChatMessage(entry, messageFormatter); | ||
const message = entry.content.text; | ||
const hasBeenEdited = false; | ||
const time = entry.timestamp; | ||
const locale = typeof navigator !== 'undefined' ? navigator.language : 'en-US'; | ||
const name = entry.direction === 'outbound' ? 'User' : 'Agent'; | ||
|
||
const isUser = entry.from?.isLocal ?? false; | ||
const isUser = entry.direction === 'outbound';//entry.from?.isLocal ?? false; | ||
const messageOrigin = isUser ? 'remote' : 'local'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Known issue that needs to be fixed: right now because agentSession.messages.list
is Array<ReceivedMessage>
, there's some type errors going on here due to these checks assuming it is actually Array<ReceivedMessage | SentMessage>
. I need to see what the swift implementation did in this case, given the "loopback" approach means only received messages would be in this list - I could probably check to see if it's a sent chat message directly but that wouldn't generalize well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied over what I think are the relevant comments from the previous pull request:
|
||
// FIXME: I think there may need to be some error handling logic to ensure the below for await | ||
// properly exposes errors via `this.closeWithError` | ||
for await (const chunk of reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic needs some tests (in each SDK), not sure which iteration is the "correct one" at the moment.
export type ReceivedMessage = | ||
| ReceivedTranscriptionMessage | ||
| ReceivedChatLoopbackMessage; | ||
// TODO: images? attachments? rpc? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's a great opportunity to visit the "attachments" thingy, @lukasIO had some suggestions on ID-based attachments (e.g. text message + image).
agent-sdk/agent-session/Agent.ts
Outdated
// FIXME: maybe add some sort of schema to this? | ||
attributes: AgentAttributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the defined attributes are primarily intended for internal usage.
If anything, I think we should strip all internal lk attributes and only expose the ones set by users.
room.on(RoomEvent.ConnectionStateChanged, handleConnectionStateChanged); | ||
|
||
|
||
const connect = async (connectOptions: AgentSessionConnectOptions = {}) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @pblazej:
- Let's make it symmetrical to the agents' counterpart
- Consider a way to inject
Room
async def start( self, agent: Agent, *, room: NotGivenOr[rtc.Room] = NOT_GIVEN, room_input_options: NotGivenOr[room_io.RoomInputOptions] = NOT_GIVEN, room_output_options: NotGivenOr[room_io.RoomOutputOptions] = NOT_GIVEN, ) -> None:
From @1egoman:
IMO, the difference in name is worthwhile / intentional, you "start" an agent on the server, but you "connect" to an agent that has been started somewhere else. But yea having some explicit options and not just mirroring room.connect() (and maybe those options could mirror the python ones) is worthwhile.
Just to help me understand more, what are some use cases where you'd want to pass a room in from the outside? I'd think the room should largely be an implementation detail of AgentSession and passing in a room means it could maybe not be exclusively used by AgentSession / allowing for manual manipulation that could invalidate assumptions about how things are organized in an agent connection scenario.
|
||
agent: AgentInstance; | ||
local: LocalInstance; | ||
messages: MessagesInstance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @pblazej:
Any strong (or JS-specific) arguments why it couldn't be flattened into one "observable" object? Agent sounds more like Participant vs an object tracking Room events.
From @1egoman:
FWIW that was the intention, I want to avoid AgentSession becoming large and having to deal with all the quirks of the agent participant / worker participant, the "on behalf of" stuff, etc.
It sounds like you are proposing this become more of a general "state container" that would have a larger responsibility than it currently does?
From @pblazej:
Yes, 2 main points here:
Agent should not know that much about the Room - looks inverted from OOP perspective to me
other ecosystems may prefer "1 smart observable publishing n stupid DTOs"
Personally I don't think AgentSession will ever be "too big" on the client side (look at the agents impl then)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still am not really a fan of getting rid of the hierarchy in AgentSession
, and actually went fairly in the other direction as I was further developing the react level abstractions. I don't think one large / "god class" type AgentSession
will provide a good developer experience or be as easy to maintain as smaller abstractions with well defined interfaces between them.
room.on(RoomEvent.ConnectionStateChanged, handleConnectionStateChanged); | ||
|
||
|
||
const connect = async (connectOptions: AgentSessionConnectOptions = {}) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implicitly enabling the microphone on connect allows only for a unnecessarily narrow set of use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lukas and I have discussed this a fair bit and we both seem to have very strong opposite opinions. I'm not exactly sure what should be done here, maybe we need to get on a call and discuss it further / find some sort of middleground.
const sendMessage: SendMessageFunction = async <Message extends SentMessage | string>( | ||
message: Message, | ||
options?: Message extends SentMessage ? SentMessageOptions<Message> : SentChatMessageOptions, | ||
) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot think of another type and would prefer the simplicity of just passing a string.
If other message types come up in the future we can add overloads either specifically for those (or in a more general manner) at a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but if this is the case, then then all the message sending logic is extremely overbuilt - with only one type of message to handle, I think I'd change this function into a room.localParticipant.sendText(...)
call and get rid of all the SentMessage
/ MessageSender
/ etc abstractions all together.
I'd like to make sure @pblazej sees this and gives me a 👍 before I do it - maybe something is being missed. Plus probably this means that these same abstractions should also change in the swift implementation as well.
The scenario this is used in: 1. User freshly views the app, and no permissions are granted 2. User grants microphone permission, but NOT camera 3. Camera device list computed, which returns an empty list / fails 4. User clicks on camera, enables it 5. Now that the camera is enabled, refetch device list - if this didn't occur, the device list would still be empty
This allows for stuff like useAgentEvents(agentSession?.local.camera, ...)
…nd add LiteralConnectionCredentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for extracting out the types, this makes it so much nicer to review :)
I focused exclusively on those for this review without looking at the actual implementation.
A lot of my comments are things you've heard from me already over some discussions, but I think there are also some new ones in there.
I think this is going in a great direction in general.
I'm being very nit picky in some of these comments, with the goal to ensure the exposed API surface is intuitive, consistent and positively opinionated for the most common use cases.
* FIXME: is this a confusing property to expose? Maybe expose one `signal` that universally | ||
* could apply across the whole agentSession.connect(...) call? | ||
*/ | ||
waitForDisconnectSignal?: AbortSignal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this being confusing. Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case I had in mind - you want to connect, but want to time out after n
ms if the connect has not started yet due to not yet being disconnected first. So:
await session.connect();
// Will timeout after 1000ms because the session hasn't disconnected before then
await session.connect({ waitForDisconnectSignal: AbortSignal.timeout(1000) });
I think ideally this could just be signal
and room.connect
could also accept it too, but it doesn't seem to accept an abortsignal.
I'm torn, It seems like not exposing it is a bad idea (since otherwise connect could have the possibility of never resolving), but I couldn't come up with a good way to expose it that wasn't awkward.
MessageReceived = 'messageReceived', | ||
Connected = 'connected', | ||
Disconnected = 'disconnected', | ||
AgentConnectionFailure = 'agentConnectionFailure', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we separate this from the AgentSessionConnectionState
? In which scenario should a connection be declared successful if there's no agent in the room?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming of this could be better in hindsight, this event gets triggered when an agent doesn't connect after the agent connect timeout. Maybe AgentConnectionTimeout
?
There's maybe an argument to be made that a better way to handle this could be to add another AgentSessionConnectionState
entry like failed
or something which a timeout could cause but maybe other types of issues could roll up into as well (what else could cause an agent failure?), and then one could use AgentSessionEvent.AgentSessionConnectionState
filtering for failed
to get a more abstract notion of the same event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question would still be whether a connection attempt should have transitioned to connected
already before a agent has joined?
Does the sequence connecting -> connected -> failed
exist? or would it always only go directly from connecting -> failed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, right now there is no "failed" state at all, so right now if the agent didn't connect, it would go from connecting
-> connected
, and then after the timeout logic runs it would emit the event and go to disconnected
.
It sounds like though that introducing a failed
state would probably be a good idea - so, my second suggestion above. Plus there could be a failedReason
field that could be used to make clear it failed due to the agent not connecting. I'll do that unless I hear otherwise.
AgentAttributesChanged = 'agentAttributesChanged', | ||
MessageReceived = 'messageReceived', | ||
Connected = 'connected', | ||
Disconnected = 'disconnected', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need events for both connectionstatechanged
and connected
/disconnected
?
I'd be in favor of simplicity/less events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, updated. FWIW the Room
has both as well, which was why I mirrored that interface.
export type AgentSessionConnectionState = 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'signalReconnecting'; | ||
|
||
export enum AgentSessionEvent { | ||
AgentConnectionStateChanged = 'agentConnectionStateChanged', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the payload of this is AgentSessionConnectionState
which is arguably conceptually different to AgentConnectionState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this was a typo - renamed to AgentSessionEvent.ConnectionStateChanged
. I could maybe do AgentSessionEvent.AgentSessionConnectionStateChanged
but that seems redundant.
* If true, adds an `exact` constraint to the getUserMedia request. | ||
* The request will fail if this option is true and the device specified is not actually available | ||
*/ | ||
exact?: boolean; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to reason about for users if device switching is exact
always.
Do you have use cases in mind for which a user switching the device should not throw on the device being unavailable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from components-js
: https://github.com/livekit/components-js/blob/d343373934baa3e412626107bb9e500454bb51ed/packages/core/src/components/mediaDeviceSelect.ts#L14-L20
I don't have a strong preference in a vaccum, but I think unless there's a good reason to deviate from the defaults components-js
sets for proxied interfaces like this, I'd be hesitant to make that change. IMO if the default should change, it should change in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, just that the default cannot change in components-js without it being a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - IMO for that reason it should stay as it is, and maybe as part of the client 3.0 project the default can change and be cascaded through.
export type MessagesInstance = { | ||
[Symbol.toStringTag]: "MessagesInstance", | ||
|
||
list: Array<ReceivedMessage>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- does this grow infinitely?
- where/how are sent messages retrieved?
<Message extends SentMessage>(message: Message, options: SentMessageOptions<Message>): Promise<void>; | ||
}; | ||
|
||
export type MessagesInstance = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this relates to probably all *Instance
concepts you defined, can users not listen to events by using *instance.on(myTargetEvent, handler)
?
|
||
export enum MessagesEvent { | ||
MessageReceived = 'messageReceived', | ||
Disconnected = 'disconnected', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would messages
(conceptually) disconnect
?
import { Participant, RemoteAudioTrack, RemoteTrackPublication, Track } from 'livekit-client'; | ||
import { participantEvents } from './LocalTrack'; | ||
|
||
export type RemoteTrackInstance<TrackSource extends Track.Source> = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
differentiating between video and audio would make sense here, I think.
A couple of reasons:
-
things like
volume
,dimensions
,orientation
etc. are media kind dependent -
audio (which is the main focus) could be attached automatically within the SDK, reducing one further setup and reducing overall api surface
/** Given a media element, properly plumb the media stream through to it so media can be shown / heard in the app. */ | ||
attachToMediaElement: (element: TrackSource extends Track.Source.Microphone | Track.Source.ScreenShareAudio ? HTMLAudioElement : HTMLVideoElement) => () => void; | ||
|
||
setSubscribed: (subscribed: boolean) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this agent wrapper needs APIs for explicit subscription. What use cases did you have in mind?
…Disconnected in favor of AgentSessionEvent.AgentConnectionStateChanged
…ntSessionEvent.ConnectionStateChanged
…on microphone / camera localtracks
(This pull request is not intended to be merged! Just here for comments.)
In #237, I had focused largely on the under the hood abstractions. However, there was some lack of clarity around the goals for that work - it seems like not having external / user facing abstractions made it difficult to really evaluate.
So, this change develops the react layer into something closer to what I had been envisioning. The internal abstractions from the previous change have changed minimally / if at all, with the biggest change being that
AgentSession
andAgent
have been converted from classes into a less mutative / reference equality checkable form to work better with the state management logic.For an easily digestible single-file example, see here:
components/single-page-demo.tsx
. If you clone down this branch locally, editing this file is a good way to try out the sdk / understand a little more what it would be like to use in practice.