Skip to content

Conversation

1egoman
Copy link
Contributor

@1egoman 1egoman commented Sep 15, 2025

This change comprises the new client agents sdk, a set of react hooks that are being built to make interaction with the livekit agents framework less complex.

This is version 3 - version 1 can be found here, and version 2 can be found here. Each step it has evolved significantly based on comments and perspectives from people who have taken a look!

Single file example

import { useEffect, useState } from "react";
import { Track, TokenSource } from "livekit-client";
import {
  useConversationWith,
  useAgent,
  useConversationMessages,

  VideoTrack,
  StartAudio,
  RoomAudioRenderer,
  useMediaDeviceSelect,
  useTrackToggle,
} from "@livekit/components-react";

// From: https://github.com/livekit/client-sdk-js/pull/1645
const tokenSource = new TokenSource.SandboxTokenServer({ sandboxId: "xxx" });

export default function SinglePageDemo() {
  const conversation = useConversationWith('voice ai quickstart', { tokenSource });

  const agent = useAgent(conversation);

  // FIXME: still using the old local participant related hooks, so this isn't much simplier than in
  // the past. Eventually I think there needs to be something like a `useLocalTrack(conversation.local.camera)`
  // hook that abstracts over all this...
  const audioDevices = useMediaDeviceSelect({ kind: "audioinput", room: conversation.subtle.room });
  const microphoneTrack = useTrackToggle({ source: Track.Source.Microphone, room: conversation.subtle.room });
  const videoDevices = useMediaDeviceSelect({ kind: "videoinput", room: conversation.subtle.room });
  const cameraTrack = useTrackToggle({ source: Track.Source.Camera, room: conversation.subtle.room });

  const [started, setStarted] = useState(false);
  useEffect(() => {
    if (!started) {
      return;
    }
    conversation.start();
    return () => {
      conversation.end();
    };
  }, [started]);

  const { messages, send, isSending } = useConversationMessages(conversation);
  const [chatMessage, setChatMessage] = useState('');

  return (
    <div className="flex flex-col gap-4 p-4">
      <div className="flex items-center gap-4">
        <Button variant="primary" onClick={() => setStarted(s => !s)} disabled={conversation.connectionState === 'connecting'}>
          {conversation.isConnected ? 'Disconnect' : 'Connect'}
        </Button>
        <span>
          <strong className="mr-1">Statuses:</strong>
          {conversation.connectionState} / {agent.state ?? 'N/A'}
        </span>
      </div>

      {conversation.isConnected ? (
        <>
          <div className="border rounded bg-muted p-2">
            <Button onClick={() => cameraTrack.toggle()} disabled={cameraTrack.pending}>
              {cameraTrack.enabled ? 'Disable' : 'Enable'} local camera
            </Button>
            <Button onClick={() => microphoneTrack.toggle()} disabled={microphoneTrack.pending}>
              {microphoneTrack.enabled ? 'Mute' : 'Un mute'} local microphone
            </Button>
            <div>
              <p>Local camera sources:</p>
              {videoDevices.devices.map(item => (
                <li
                  key={item.deviceId}
                  onClick={() => videoDevices.setActiveMediaDevice(item.deviceId)}
                  style={{ color: item.deviceId === videoDevices.activeDeviceId ? 'red' : undefined }}
                >
                  {item.label}
                </li>
              ))}
            </div>
            <div>
              <p>Local microphone sources:</p>
              {audioDevices.devices.map(item => (
                <li
                  key={item.deviceId}
                  onClick={() => audioDevices.setActiveMediaDevice(item.deviceId)}
                  style={{ color: item.deviceId === audioDevices.activeDeviceId ? 'red' : undefined }}
                >
                  {item.label}
                </li>
              ))}
            </div>
          </div>

          <div>
            {conversation.local.camera.publication ? (
              <VideoTrack trackRef={conversation.local.camera} />
            ) : null}
            {agent.camera ? (
              <VideoTrack trackRef={agent.camera} />
            ) : null}
          </div>

          <ul>
            {messages.map(receivedMessage => (
              <li key={receivedMessage.id}>{receivedMessage.message}</li>
            ))}
            <li className="flex items-center gap-1">
              <input
                type="text"
                value={chatMessage}
                onChange={e => setChatMessage(e.target.value)}
                className="border border-2"
              />
              <Button
                variant="secondary"
                disabled={isSending}
                onClick={() => {
                  send(chatMessage);
                  setChatMessage('');
                }}
              >{isSending ? 'Sending' : 'Send'}</Button>
            </li>
          </ul>
        </>
      ) : null}

      <StartAudio label="Start audio" />
      <RoomAudioRenderer room={conversation.subtle.room} />
    </div>
  );
}

New API surface area

  • useConversationWith(agentName: string, options: UseConversationWithOptions): Conversation
    A thin wrapper around a Room which handles connecting to a room and dispatching a given agent into that room (or in the future, maybe multiple agents?). In the future it will probably become thicker as more global agent state is required.
const tokenSource: TokenSource = /* ... */;

const conversation = useConversationWith('agent name to dispatch', {
  // NOTE: either `room` can be a property here, or if not specified, if reads `room` from `RoomContext`
  tokenSource,
});

useEffect(() => {
  conversation.begin();
  return () => {
    conversation.end();
  };
}, [conversation]);

// NOTE: what dispatching multiple agents could look like in the future:
// `useConversationWith(['agent a', 'agent b'], { tokenSource });`

// TBD: does there need to be a way to start a conversation without dispatching an agent, maybe so
// automatic dispatch can occur? If so, maybe add:
// `useConversation(options: UseConversationWithOptions): Conversation`?
  • useAgent(conversation: Conversation): Agent
    A much more advanced version of the previously existing useVoiceAssistant hook - tracks the agent's state within the conversation, manages agent connection timeouts / other failures, and largely maintains backwards compatibility with existing interfaces.
const agent = useAgent(conversation);

// Log agent connection errors
useEffect(() => {
  if (agent.state === "failed") {
    console.error(`Error connecting to agent: ${agent.failureReasons.join(", ")}`);
  }
}, [agent]);

// later on, in a component:
<VideoTrack trackRef={agent.camera} /> 
  • useConversationMessages
    A mechanism for interacting with ReceivedMessages across the whole conversation. A ReceivedMessage can be a ReceivedChatMessage (already exists today), or a ReceivedUserTranscriptionMessage / ReceivedAgentTranscriptionMessage (both brand new). This is exposed at the conversation level so in a future world where multiple agents are within a conversation, this hook will return messages from all of them
const { messages, isSending, send } = useConversationMessages(conversation);
// NOTE: send / isSending are proxies of the existing interface returned by `useChat`

// later on, in a component:
<ul>
  {messages.map(receivedMessage => (
    <li key={receivedMessage.id}>{receivedMessage.from.name}: {receivedMessage.message}</li>
  )}
</ul>

Additional refactoring / cleanup

  • Added a new ParticipantAgentAttributes constant and ported all usages of lk.-prefixed attributes (which previously were just magic strings in the code) to refer to this enum.
  • Fixed type error in handleMediaDeviceError callback function in useLiveKitRoom
  • Added support for explicit room parameter to a few hooks and components that didn't support it previously, to make single file example type scenarios easier:
    • RoomAudioRenderer
    • StartAudio
    • useChat
    • useTextStream
    • useTrackToggle
    • useTranscriptions

Copy link

changeset-bot bot commented Sep 15, 2025

⚠️ No Changeset found

Latest commit: 7073ac6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@1egoman 1egoman force-pushed the agent-sdk branch 3 times, most recently from ce80b15 to ef0fed7 Compare September 17, 2025 20:44
Comment on lines 8 to 39
type ReceivedMessageWithType<
Type extends string,
Metadata extends {} = {},
> = {
id: string;
timestamp: number;

type: Type;

from?: Participant;
attributes?: Record<string, string>;
} & Metadata;

/** @public */
export type ReceivedChatMessage = ReceivedMessageWithType<'chatMessage', ChatMessage & {
from?: Participant;
attributes?: Record<string, string>;
}>;

export type ReceivedUserTranscriptionMessage = ReceivedMessageWithType<'userTranscript', {
message: string;
}>;

export type ReceivedAgentTranscriptionMessage = ReceivedMessageWithType<'agentTranscript', {
message: string;
}>;

/** @public */
export type ReceivedMessage =
| ReceivedUserTranscriptionMessage
| ReceivedAgentTranscriptionMessage
| ReceivedChatMessage
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported the existing ReceivedMessage abstraction from here on top of the pre-existing ReceivedChatMessage - this means that now,ReceivedChatMessage is now a ReceivedMessage subtype.

Note ReceivedChatMessage has one new type field addition which acts as the discriminant key in ReceivedMessage, but otherwise is identical. So this should be a fully backwards compatible change even though behind the scenes a lot has been updated.

Comment on lines 1 to 10
import { RoomConfiguration } from '@livekit/protocol';
import { decodeJwt } from 'jose';

const ONE_SECOND_IN_MILLISECONDS = 1000;
const ONE_MINUTE_IN_MILLISECONDS = 60 * ONE_SECOND_IN_MILLISECONDS;

/**
* TokenSource handles getting credentials for connecting to a new Room, caching
* the last result and using it until it expires. */
export abstract class TokenSource {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - for now I copied the token source code into here just for experimenting around, because as of mid september 2025, it still hasn't been merged. This will be removed before these changes get published.

Comment on lines 15 to 34

/** State representing the current status of the agent, whether it is ready for speach, etc */
export type AgentStateNew = 'unset' | 'initializing' | 'failed' | 'idle' | 'listening' | 'thinking' | 'speaking';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thinking needs to be done about this type name and some of its implications. AgentState is already being exported from this package by useVoiceAssistant, so it can't be the same thing.

Previously I had been using the same AgentState value, but @lukasIO pushed back on that in a previous comment so I split it out into two different values.

To disambiguate: the AgentState value here is different because useVoiceAssistant's AgentState currently conflates state related to the room connection lifecycle with the agent lifecycle. So useVoiceAssistant's AgentState has values like connecting and the "new agent state" omits those, but has different new states that are specific to the new implementation like failed and unset (kinda a catch all state for when the room isn't fully initialized and the agent hasn't started initializing yet).

So either this needs to be named something completely new or maybe I need to recombine all the old room connection related agent state values into the new value so it can be a strict superset, and therefore wouldn't be a backwards incompatible change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I had been using the same AgentState value, but @lukasIO pushed back on that livekit-examples/agent-starter-react#237 (comment) so I split it out into two different values.

maybe this was a misunderstanding, my comment was meant to refer to a room connection state (-> your local connection state which would indicate local connection problems) vs an agent state.

Can you elaborate on when failed and unset would be readable ?
If an agent isn't yet present, the agent object wouldn't be present either, right?

Copy link
Contributor Author

@1egoman 1egoman Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasIO

If an agent isn't yet present, the agent object wouldn't be present either, right?

This is incorrect - because useAgent is hook, it always is possible for a user to call it no matter the state of the conversation. I suppose useAgent(conversationWithDisconnectedRoom) could return null in this case, but I opted to always return an object so it could be destructured rather than having to deal with the optional property accesses that returning a null at the root would necessitate. This is what the unset value is used to represent.

The useVoiceAssistant hook doesn't have this problem, because in effect its "unset" value is instead proxying the room.connectionState value - ie, conflating the two concepts. Here's how the two statuses interrelate.

Can you elaborate on when failed and unset would be readable ?

If a room isn't yet fully connected, then right now an agent's state is unset. So for example:

const conversation = useConversationWith('agent name to dispatch', { tokenSource });
// Note: no room connection logic is happening here, so conversation.connectionState is "disconnected"
const agent = useAgent(conversation);
console.log(agent.state); // "unset"

If an agent never connects, then the state can go into failed (I left it generic so more "failures" could eventually be captured, but right now it's just agent timeout). For example:

const conversation = useConversationWith('agent name to dispatch', { tokenSource });
useEffect(() => { conversation.start() }, []); // Connect to room / dispatch agent
const agent = useAgent(conversation);

// Initially:
console.log(agent.state); // "connecting"

// After a delay, this could potentially happen:
console.log(agent.state); // "failed"
console.log(agent.failureReasons); // ["Agent did not join the room."]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I opted to always return an object so it could be destructured rather than having to deal with the optional property accesses that returning a null at the root would necessitate.

The main upside I see with the approach you're taking is that it would allow for "pretending" an agent is present in preconnectbuffer usage.

However this would in turn not make sense if – when preconnect buffer is enabled – the value would be unset and not listening.

iiuc this simply shifts the problem from optional access to state check?

given that this also means that you'd have to check against two "non available" states of the agent it doesn't make this pattern super obvious to me.

// Option 1
const agent = useAgent(conversation) as Agent | undefined;
return <> { agent ? <MyAgentComponent camera={agent.camera} : <p>'Waiting for agent to connect'</p>} </>


// Option 2
```ts

const agent = useAgent(conversation) as Agent;
return <> { agent.state !== 'failed' && agent.state !== 'unset' ? <MyAgentComponent camera={agent.camera} : <p>'Waiting for agent to connect'</p> } </>

Copy link
Contributor Author

@1egoman 1egoman Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I have been opting to do in that case is something closer to option 1, but with checking the agent camera directly and not using a state check in that case, since as you have identified including / excluding certain states like option 2 is awkward. So effectively I'm "pushing down" that null further than what you are doing in 1.

So like:

const agent = useAgent(conversation);
// or:
// const { camera } = useAgent(conversation);

return (
  <>
    {agent.camera ? (
      <MyAgentComponent camera={agent.camera} />
    ) : (
      <p>Waiting for agent...</p>
    )
  </>
);

Also keep in mind that before an agent is fully connected, there may be other properties available on it other than camera - returning null like you are doing above means that there's no way to access other values in there. But, if you don't return a top level null, then it means the other side of that tradeoff requires an object to be returned, resulting in this AgentState problem as the other side of that tradeoff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other properties would those be?

the camera option is just an example. Imagine a text only agent that's not publishing anything itself, but you'd still want to check for its presence before doing/rendering some component

Copy link
Contributor Author

@1egoman 1egoman Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to check for the agent's presence, that's where the is-prefixed boolean properties come in, which effectively check sets of states. In particular I think isAvailable is what you are looking for in this case. So:

const agent = useAgent(conversation);

return (
  <>
    {agent.isAvailable ? (
      <p>Agent is ready for user interaction</p>
    ) : (
      <p>Waiting for agent...</p>
    )
  </>
);

I largely copied this pattern from react-query, fwiw: https://tanstack.com/query/latest/docs/framework/react/reference/useQuery

(it's worth noting that agent.camera still remains nullable even when agent.isAvailable is asserted to be true, since an agent may not actually emit any sort of video. So what I mentioned in my last message still probably is the right thing to do for rendering a VideoTrack like that)

Copy link
Contributor Author

@1egoman 1egoman Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @lukasIO on a call - we decided to migrate unset -> connecting, and then add the failed status into the existing AgentState list.

Here is what I have now:

/** @see https://github.com/livekit/agents/blob/65170238db197f62f479eb7aaef1c0e18bfad6e7/livekit-agents/livekit/agents/voice/events.py#L97 */
type AgentSdkStates = 'initializing' | 'idle' | 'listening' | 'thinking' | 'speaking';

/**
  * State representing the current status of the agent, whether it is ready for speach, etc
  *
  * For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle:
  *   connecting -> listening -> listening/thinking/speaking
  *
  * For agents without the preconnect audio feature enabled:
  *   connecting -> initializing -> idle/listening/thinking/speaking
  *
  * If an agent fails to connect:
  *   connecting -> listening/initializing -> failed
  *
  * Legacy useVoiceAssistant hook:
  *   disconnected -> connecting -> initializing -> listening/thinking/speaking
  * */
export type AgentState = 'disconnected' | 'connecting' | 'failed' | AgentSdkStates;

Also, we decided adding a isPreConnectBufferEnabled type boolean when in the listening state would probably be important - while most use cases probably people wouldn't care about this, it seems like it would be good to be able to know in a subset of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally sounds good to me, to minor things:

  • for the use case you outlined isPreConnectBufferEnabled isn't ideal as users cannot differentiate between different listening states easily. I think a flag that indicates that we're in a preliminary listening state would be preferable.
  • For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle: connecting -> listening -> listening/thinking/speaking – to clarify, why is initializing missing here? I think the agent could still set initializing on itself ? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the use case you outlined isPreConnectBufferEnabled isn't ideal as users cannot differentiate between different listening states easily. I think a flag that indicates that we're in a preliminary listening state would be preferable.

Discussed on call - what I have now actually has this behavior sort of by accident 😬 , but we couldn't come up with a good name for what this should be called at the time. With some more pondering I'm thinking maybe isBufferingSpeech, so agent.state === "listening" && agent.isBufferingSpeech. I'm going to go with that unless I hear objections to that name.

For most agents (which have the preconnect audio buffer feature enabled), this is the lifecycle: connecting -> listening -> listening/thinking/speaking – to clarify, why is initializing missing here? I think the agent could still set initializing on itself ? 🤔

Also discussed on a call - I didn't include it because I think in (almost?) all cases the preconnect buffer listening would occur while that initialization step is running but I suppose it technically is a race condition and maybe could be possible, so I updated the comment to add it in.

@1egoman 1egoman changed the title [WIP] Agent SDK - ported on top of components-js primatives Agent SDK - ported on top of components-js primatives Sep 17, 2025
@1egoman 1egoman marked this pull request as ready for review September 17, 2025 20:53
@1egoman 1egoman requested review from lukasIO and pblazej September 17, 2025 20:53
@lukasIO
Copy link
Contributor

lukasIO commented Sep 18, 2025

const audioDevices = useMediaDeviceSelect({ kind: "audioinput", room: conversation.subtle.room });
const microphoneTrack = useTrackToggle({ source: Track.Source.Microphone, room: conversation.subtle.room });
const videoDevices = useMediaDeviceSelect({ kind: "videoinput", room: conversation.subtle.room });
const cameraTrack = useTrackToggle({ source: Track.Source.Camera, room: conversation.subtle.room });

how about proxying some of these on the return value of useConversation ?

@1egoman
Copy link
Contributor Author

1egoman commented Sep 18, 2025

how about proxying some of these on the return value of useConversation ?

I opted to leave that out for now because of the hesitancy from ben/dz around new track abstractions. That being said, I mentioned in the comment above that I had been proposing a new hook, useLocalTrack, which would take in a track reference returned from other abstractions (conversation, agent, or even other non agent related abstractions), but it also could live underneath the conversation too.

It sounds like you are pushing for that to exist now vs deferring it? If so I can add that new hook to this branch or possibly figure out how to fit it into conversation.

Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, i like the direction where this is going.

};

type ConversationStateCommon = {
subtle: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this naming. what is the intent for exposing it this way?

Copy link
Contributor

@lukasIO lukasIO Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was largely my ask, Ryan also felt it was a bit awkward.

The idea behind it:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we actually simplifying it by putting these things in another sub namespace? I could go either way on this, but initially it seems to be higher in cognitive load.. wondering why things are done one way vs another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's definitely not simplifying things if we expect users to regularly be using the things under the sub namespace.
It only makes sense to introduce it if we treat the things under the sub namespace as a mere escape hatch for users who know what they're doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we either find a more meaningful grouping, or just keep them flat. the "advanced" vs "not advanced" line is pretty hard to draw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some of them (e.g. the emitter) I think it would be good to be hidden under a name after discussing this more with @1egoman .

some alternative suggestions:

  • advanced
  • raw
  • internal

Copy link
Contributor Author

@1egoman 1egoman Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did two things to try to address this:

  • Renamed subtle everywhere to internal - this key contains values that need to be exported so other hooks / logic can tap into it (so they can't just be internal implementation details of the hook / not exposed in the return value), but are not expected for users to actually use (though they could in theory use them, but with similar types of pitfalls as accessing _-prefixed properties in python)
  • Moved room from useSession().subtle.room to useSession().room, since this is expected for users to access / is not internal.

1egoman added 14 commits October 1, 2025 11:14
This allows functions to limit whether they just want to take in
TrackReferences from a given source - ie, the VideoTrack could be made
to only accept TrackReference<Track.Source.Camera | Track.Source.Screenshare | Track.Source.Unknown>.
Note that just the return values are changing, not the argument
definitions in other spots, so this shouldn't be a backwards
compatibility issue.
The pre-existing state was broken.
…s going to be important for future multi-agent type use cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants