Skip to content

Conversation

1egoman
Copy link
Contributor

@1egoman 1egoman commented Sep 2, 2025

(This pull request is not intended to be merged! Just here for comments.)

In #237, I had focused largely on the under the hood abstractions. However, there was some lack of clarity around the goals for that work - it seems like not having external / user facing abstractions made it difficult to really evaluate.

So, this change develops the react layer into something closer to what I had been envisioning. The internal abstractions from the previous change have changed minimally / if at all, with the biggest change being that AgentSession and Agent have been converted from classes into a less mutative / reference equality checkable form to work better with the state management logic.

For an easily digestible single-file example, see here: components/single-page-demo.tsx. If you clone down this branch locally, editing this file is a good way to try out the sdk / understand a little more what it would be like to use in practice.

1egoman added 30 commits August 11, 2025 12:03
@1egoman 1egoman changed the title Agent sdk prototype react Agent SDK Prototype - React layer Sep 2, 2025
Copy link

vercel bot commented Sep 2, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
sandbox-voice-assistant Error Error Sep 4, 2025 4:19pm

Comment on lines +188 to +190
export const createUseAgentSession = (options: AgentSessionOptions) => create<AgentSessionInstance>((set, get) => {
return createAgentSession(options, get, set);
});
Copy link
Contributor Author

@1egoman 1egoman Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of providing a large list of smaller hooks, I opted to provide one single main hook useAgentSession which can be used to access the agent state and all user facing actions.

General notes

  • Behind the scenes, this is a zustand store, so const connect = useAgentSession(session => session.connect); can be used to get a smaller slice of the inner state.
  • To implement "component level" hooks for the future agents components, one would effectively just write a mapping function: const useMyFancyComponent = () => useAgentSession(session => [session.connect, session.disconnect]);.
  • While the interface here is primarily targeted towards react, it is possible to make a zustand store outside of a react context - see here. So if supporting non react platforms is important in the future (vanilla or other frameworks), that becomes the means to do it!

Backwards compatibility

At the moment this abstraction is optimized for use with the new agents specific components currently being discussed, or fully LLM generated components. However, if you wanted to use the connection management logic from this with existing components in components-js, there's a pretty straightforward escape hatch:

import { RoomContext } from '@livekit/components-react';

// later, in a component:
const agentSession = useAgentSession();
return (
  <RoomContext.Provider value={agentSession.subtle.room}>
    {/* rest of the app here */}
  </RoomContext.Provider>
);

It is unclear if this needs to be made more official and how that might be done if that needed to happen!

Comment on lines 82 to 140
type AgentSessionInstanceCommon = {
[Symbol.toStringTag]: "AgentSessionInstance",

credentials: ConnectionCredentialsProvider;

/** Returns a promise that resolves once the room connects. */
waitUntilConnected: (signal?: AbortSignal) => void;
/** Returns a promise that resolves once the room disconnects */
waitUntilDisconnected: (signal?: AbortSignal) => void;

agentConnectTimeout: {
delayInMilliseconds: number;
timeoutId: NodeJS.Timeout | null;
} | null,

prepareConnection: () => Promise<void>,
connect: (options?: AgentSessionConnectOptions) => Promise<void>;
disconnect: () => Promise<void>;

canPlayAudio: boolean;
startAudio: () => Promise<void>;

subtle: {
emitter: TypedEventEmitter<AgentSessionCallbacks>;
room: Room;
};
};

type AgentSessionInstanceConnecting = AgentSessionInstanceCommon & {
connectionState: "connecting";
isConnected: false;
isReconnecting: false;

agent: AgentInstance | null;
local: LocalInstance | null;
messages: MessagesInstance | null;
};

type AgentSessionInstanceConnected = AgentSessionInstanceCommon & {
connectionState: "connected" | "reconnecting" | "signalReconnecting";
isConnected: true;
isReconnecting: boolean;

agent: AgentInstance;
local: LocalInstance;
messages: MessagesInstance;
};

type AgentSessionInstanceDisconnected = AgentSessionInstanceCommon & {
connectionState: "disconnected";
isConnected: false;
isReconnecting: false;

agent: null;
local: null;
messages: null;
};

export type AgentSessionInstance = AgentSessionInstanceConnecting | AgentSessionInstanceConnected | AgentSessionInstanceDisconnected;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the top level AgentSession state looks like - ie, what useAgentSession() would return by default.

Comment on lines 192 to 216
export function useAgentEvents<
Emitter extends TypedEventEmitter<EventMap>,
EmitterEventMap extends (Emitter extends TypedEventEmitter<infer EM> ? EM : never),
Event extends Parameters<Emitter["on"]>[0],
Callback extends EmitterEventMap[Event],
>(
instance: { subtle: { emitter: Emitter } },
event: Event,
handlerFn: Callback | undefined,
dependencies?: React.DependencyList
) {
const fallback = useMemo(() => () => {}, []);
const wrappedCallback = useCallback(handlerFn ?? fallback, dependencies ?? []);
const callback = dependencies ? wrappedCallback : handlerFn;

useEffect(() => {
if (!callback) {
return;
}
instance.subtle.emitter.on(event, callback);
return () => {
instance.subtle.emitter.off(event, callback);
};
}, [instance.subtle.emitter, event, callback]);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the main useAgentSession hook, there is one other hook exposed, useAgentEvents. This allows one to listen to events from within the agent sdk infrastructure, as these don't really fit into the existing declarative / zustand based paradigm.

For example:

const agentSession = useAgentSession();
useAgentEvents(agentSession.local.camera, LocalTrackEvent.ActiveDeviceChangeError, () => {
  /* handle event here - switching camera devices failed */
});

Behind the scenes, this function accepts any object which has a .subtle.emitter property and effectively acts as a declarative way to call .on() / .off().

Comment on lines +14 to +62
export const AgentVideoTrack: React.FunctionComponent<{
className?: string,
track: LocalTrackInstance<Track.Source.Camera | Track.Source.ScreenShare> | RemoteTrackInstance<Track.Source.Camera | Track.Source.ScreenShare>,
} & React.HTMLAttributes<HTMLVideoElement>> = ({ track, ...rest }) => {
// FIXME: imperative handle logic
const mediaElementRef = useRef<HTMLVideoElement>(null);

useEffect(() => {
if (!mediaElementRef.current) {
return;
}
const mediaElement = mediaElementRef.current;

if (!track.enabled) {
return;
}

let cleanup: (() => void) | null = null;
(async () => {
if (!track.isLocal) {
// FIXME: intersection observer logic
track.setSubscribed(true);
await track.waitUntilSubscribed(); // FIXME: move inside of attachToMediaElement
}

cleanup = track.attachToMediaElement(mediaElement);
})()

return () => {
if (!track.isLocal) {
track.setSubscribed(false);
}
cleanup?.();
};
}, [track]);

return (
<video
ref={mediaElementRef}
data-lk-local-participant={false}
data-lk-source={track.source}
data-lk-orientation={track.orientation}
muted={true}
// onClick={clickHandler}
{...rest}
/>
);
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few components the agents sdk would need to export to do things that components-js did previously, namely:

  • AgentVideoTrack - Render a video track on screen
  • AgentAudioTrack - Render an audio track / auto play it
  • AgentRoomAudioRenderer - Render all audio tracks in the room using AgentAudioTrack

For all three I kept the apis as close as possible to what they were before, except for now they take in AgentSession state subslices (ie, tracks, the agent state, etc) based on what is needed.

Comment on lines +25 to 34
// FIXME: Where would this kind of metadata come from for real?
// const { message, hasBeenEdited, time, locale, name } = useChatMessage(entry, messageFormatter);
const message = entry.content.text;
const hasBeenEdited = false;
const time = entry.timestamp;
const locale = typeof navigator !== 'undefined' ? navigator.language : 'en-US';
const name = entry.direction === 'outbound' ? 'User' : 'Agent';

const isUser = entry.from?.isLocal ?? false;
const isUser = entry.direction === 'outbound';//entry.from?.isLocal ?? false;
const messageOrigin = isUser ? 'remote' : 'local';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known issue that needs to be fixed: right now because agentSession.messages.list is Array<ReceivedMessage>, there's some type errors going on here due to these checks assuming it is actually Array<ReceivedMessage | SentMessage>. I need to see what the swift implementation did in this case, given the "loopback" approach means only received messages would be in this list - I could probably check to see if it's a sent chat message directly but that wouldn't generalize well.

Copy link
Contributor Author

@1egoman 1egoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied over what I think are the relevant comments from the previous pull request:


// FIXME: I think there may need to be some error handling logic to ensure the below for await
// properly exposes errors via `this.closeWithError`
for await (const chunk of reader) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

This logic needs some tests (in each SDK), not sure which iteration is the "correct one" at the moment.

export type ReceivedMessage =
| ReceivedTranscriptionMessage
| ReceivedChatLoopbackMessage;
// TODO: images? attachments? rpc?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

IMO that's a great opportunity to visit the "attachments" thingy, @lukasIO had some suggestions on ID-based attachments (e.g. text message + image).

Comment on lines 43 to 44
// FIXME: maybe add some sort of schema to this?
attributes: AgentAttributes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

the defined attributes are primarily intended for internal usage.

If anything, I think we should strip all internal lk attributes and only expose the ones set by users.

room.on(RoomEvent.ConnectionStateChanged, handleConnectionStateChanged);


const connect = async (connectOptions: AgentSessionConnectOptions = {}) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

From @pblazej:

  • Let's make it symmetrical to the agents' counterpart
  • Consider a way to inject Room
async def start(
        self,
        agent: Agent,
        *,
        room: NotGivenOr[rtc.Room] = NOT_GIVEN,
        room_input_options: NotGivenOr[room_io.RoomInputOptions] = NOT_GIVEN,
        room_output_options: NotGivenOr[room_io.RoomOutputOptions] = NOT_GIVEN,
    ) -> None:

From @1egoman:

IMO, the difference in name is worthwhile / intentional, you "start" an agent on the server, but you "connect" to an agent that has been started somewhere else. But yea having some explicit options and not just mirroring room.connect() (and maybe those options could mirror the python ones) is worthwhile.

Just to help me understand more, what are some use cases where you'd want to pass a room in from the outside? I'd think the room should largely be an implementation detail of AgentSession and passing in a room means it could maybe not be exclusively used by AgentSession / allowing for manual manipulation that could invalidate assumptions about how things are organized in an agent connection scenario.

Comment on lines +124 to +127

agent: AgentInstance;
local: LocalInstance;
messages: MessagesInstance;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

From @pblazej:

Any strong (or JS-specific) arguments why it couldn't be flattened into one "observable" object? Agent sounds more like Participant vs an object tracking Room events.

From @1egoman:

FWIW that was the intention, I want to avoid AgentSession becoming large and having to deal with all the quirks of the agent participant / worker participant, the "on behalf of" stuff, etc.

It sounds like you are proposing this become more of a general "state container" that would have a larger responsibility than it currently does?

From @pblazej:

Yes, 2 main points here:

Agent should not know that much about the Room - looks inverted from OOP perspective to me
other ecosystems may prefer "1 smart observable publishing n stupid DTOs"
Personally I don't think AgentSession will ever be "too big" on the client side (look at the agents impl then)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not really a fan of getting rid of the hierarchy in AgentSession, and actually went fairly in the other direction as I was further developing the react level abstractions. I don't think one large / "god class" type AgentSession will provide a good developer experience or be as easy to maintain as smaller abstractions with well defined interfaces between them.

room.on(RoomEvent.ConnectionStateChanged, handleConnectionStateChanged);


const connect = async (connectOptions: AgentSessionConnectOptions = {}) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

implicitly enabling the microphone on connect allows only for a unnecessarily narrow set of use cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lukas and I have discussed this a fair bit and we both seem to have very strong opposite opinions. I'm not exactly sure what should be done here, maybe we need to get on a call and discuss it further / find some sort of middleground.

Comment on lines +127 to +130
const sendMessage: SendMessageFunction = async <Message extends SentMessage | string>(
message: Message,
options?: Message extends SentMessage ? SentMessageOptions<Message> : SentChatMessageOptions,
) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#237 (comment)

I cannot think of another type and would prefer the simplicity of just passing a string.
If other message types come up in the future we can add overloads either specifically for those (or in a more general manner) at a later stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but if this is the case, then then all the message sending logic is extremely overbuilt - with only one type of message to handle, I think I'd change this function into a room.localParticipant.sendText(...) call and get rid of all the SentMessage / MessageSender / etc abstractions all together.

I'd like to make sure @pblazej sees this and gives me a 👍 before I do it - maybe something is being missed. Plus probably this means that these same abstractions should also change in the swift implementation as well.

The scenario this is used in:
1. User freshly views the app, and no permissions are granted
2. User grants microphone permission, but NOT camera
3. Camera device list computed, which returns an empty list / fails
4. User clicks on camera, enables it
5. Now that the camera is enabled, refetch device list - if this didn't
   occur, the device list would still be empty
This allows for stuff like useAgentEvents(agentSession?.local.camera, ...)
Copy link
Contributor

@lukasIO lukasIO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for extracting out the types, this makes it so much nicer to review :)

I focused exclusively on those for this review without looking at the actual implementation.
A lot of my comments are things you've heard from me already over some discussions, but I think there are also some new ones in there.
I think this is going in a great direction in general.

I'm being very nit picky in some of these comments, with the goal to ensure the exposed API surface is intuitive, consistent and positively opinionated for the most common use cases.

* FIXME: is this a confusing property to expose? Maybe expose one `signal` that universally
* could apply across the whole agentSession.connect(...) call?
*/
waitForDisconnectSignal?: AbortSignal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this being confusing. Why do we need this?

Copy link
Contributor Author

@1egoman 1egoman Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case I had in mind - you want to connect, but want to time out after nms if the connect has not started yet due to not yet being disconnected first. So:

await session.connect();

// Will timeout after 1000ms because the session hasn't disconnected before then
await session.connect({ waitForDisconnectSignal: AbortSignal.timeout(1000) });

I think ideally this could just be signal and room.connect could also accept it too, but it doesn't seem to accept an abortsignal.

I'm torn, It seems like not exposing it is a bad idea (since otherwise connect could have the possibility of never resolving), but I couldn't come up with a good way to expose it that wasn't awkward.

MessageReceived = 'messageReceived',
Connected = 'connected',
Disconnected = 'disconnected',
AgentConnectionFailure = 'agentConnectionFailure',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we separate this from the AgentSessionConnectionState? In which scenario should a connection be declared successful if there's no agent in the room?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming of this could be better in hindsight, this event gets triggered when an agent doesn't connect after the agent connect timeout. Maybe AgentConnectionTimeout?

There's maybe an argument to be made that a better way to handle this could be to add another AgentSessionConnectionState entry like failed or something which a timeout could cause but maybe other types of issues could roll up into as well (what else could cause an agent failure?), and then one could use AgentSessionEvent.AgentSessionConnectionState filtering for failed to get a more abstract notion of the same event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question would still be whether a connection attempt should have transitioned to connected already before a agent has joined?

Does the sequence connecting -> connected -> failed exist? or would it always only go directly from connecting -> failed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, right now there is no "failed" state at all, so right now if the agent didn't connect, it would go from connecting -> connected, and then after the timeout logic runs it would emit the event and go to disconnected.

It sounds like though that introducing a failed state would probably be a good idea - so, my second suggestion above. Plus there could be a failedReason field that could be used to make clear it failed due to the agent not connecting. I'll do that unless I hear otherwise.

AgentAttributesChanged = 'agentAttributesChanged',
MessageReceived = 'messageReceived',
Connected = 'connected',
Disconnected = 'disconnected',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need events for both connectionstatechanged and connected/disconnected ?
I'd be in favor of simplicity/less events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, updated. FWIW the Room has both as well, which was why I mirrored that interface.

export type AgentSessionConnectionState = 'disconnected' | 'connecting' | 'connected' | 'reconnecting' | 'signalReconnecting';

export enum AgentSessionEvent {
AgentConnectionStateChanged = 'agentConnectionStateChanged',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the payload of this is AgentSessionConnectionState which is arguably conceptually different to AgentConnectionState.

Copy link
Contributor Author

@1egoman 1egoman Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this was a typo - renamed to AgentSessionEvent.ConnectionStateChanged. I could maybe do AgentSessionEvent.AgentSessionConnectionStateChanged but that seems redundant.

* If true, adds an `exact` constraint to the getUserMedia request.
* The request will fail if this option is true and the device specified is not actually available
*/
exact?: boolean;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to reason about for users if device switching is exact always.
Do you have use cases in mind for which a user switching the device should not throw on the device being unavailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from components-js: https://github.com/livekit/components-js/blob/d343373934baa3e412626107bb9e500454bb51ed/packages/core/src/components/mediaDeviceSelect.ts#L14-L20

I don't have a strong preference in a vaccum, but I think unless there's a good reason to deviate from the defaults components-js sets for proxied interfaces like this, I'd be hesitant to make that change. IMO if the default should change, it should change in both places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, just that the default cannot change in components-js without it being a breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - IMO for that reason it should stay as it is, and maybe as part of the client 3.0 project the default can change and be cascaded through.

export type MessagesInstance = {
[Symbol.toStringTag]: "MessagesInstance",

list: Array<ReceivedMessage>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. does this grow infinitely?
  2. where/how are sent messages retrieved?

<Message extends SentMessage>(message: Message, options: SentMessageOptions<Message>): Promise<void>;
};

export type MessagesInstance = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this relates to probably all *Instance concepts you defined, can users not listen to events by using *instance.on(myTargetEvent, handler) ?


export enum MessagesEvent {
MessageReceived = 'messageReceived',
Disconnected = 'disconnected',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would messages (conceptually) disconnect ?

import { Participant, RemoteAudioTrack, RemoteTrackPublication, Track } from 'livekit-client';
import { participantEvents } from './LocalTrack';

export type RemoteTrackInstance<TrackSource extends Track.Source> = {
Copy link
Contributor

@lukasIO lukasIO Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

differentiating between video and audio would make sense here, I think.

A couple of reasons:

  • things like volume, dimensions, orientation etc. are media kind dependent

  • audio (which is the main focus) could be attached automatically within the SDK, reducing one further setup and reducing overall api surface

/** Given a media element, properly plumb the media stream through to it so media can be shown / heard in the app. */
attachToMediaElement: (element: TrackSource extends Track.Source.Microphone | Track.Source.ScreenShareAudio ? HTMLAudioElement : HTMLVideoElement) => () => void;

setSubscribed: (subscribed: boolean) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this agent wrapper needs APIs for explicit subscription. What use cases did you have in mind?

…Disconnected in favor of AgentSessionEvent.AgentConnectionStateChanged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants