Skip to content

unctad-ai/voice-agent-kit

Repository files navigation

Voice Agent Kit

A modular toolkit for adding a voice-powered AI copilot to any web application. Provides a complete voice pipeline — speech-to-text, voice activity detection, LLM reasoning with tool use, and text-to-speech — packaged as drop-in React components and Express handlers.

Built for government service portals (eRegistrations), but adaptable to any domain where users need guided, conversational assistance.

Packages

Package Description
@unctad-ai/voice-agent-core Hooks, types, and configuration for the voice pipeline (VAD, audio, state management)
@unctad-ai/voice-agent-ui Glass-morphism UI components — floating panel, orb, waveform, onboarding, settings
@unctad-ai/voice-agent-registries Dynamic registries for form fields, UI actions, and client-side tool handlers
@unctad-ai/voice-agent-server Express route handlers for chat, STT, and TTS with pluggable providers and automatic fallback chains

All packages are published to npm under the @unctad-ai scope and versioned together.

Quick Start

Install

# Client
npm install @unctad-ai/voice-agent-core @unctad-ai/voice-agent-ui @unctad-ai/voice-agent-registries

# Server
npm install @unctad-ai/voice-agent-core @unctad-ai/voice-agent-server

# Peer dependencies (client)
npm install react react-dom motion lucide-react simplex-noise

Wire Up the Client

// voice-config.ts
import type { SiteConfig } from '@unctad-ai/voice-agent-core';

export const siteConfig: SiteConfig = {
  copilotName: 'Pesa',
  siteTitle: "Kenya's Business Gateway",
  farewellMessage: 'Feel free to come back anytime.',
  systemPromptIntro: 'You help investors navigate government services.',

  avatarUrl: '/avatar.png',
  colors: {
    primary: '#DB2129',
    processing: '#F59E0B',
    speaking: '#14B8A6',
    glow: '#f35f3f',
  },

  services: [/* your service catalog */],
  categories: [/* grouped categories */],
  synonyms: { tax: ['pin', 'vat', 'kra'] },
  categoryMap: { investor: 'Investor services' },
  routeMap: { home: '/', dashboard: '/dashboard' },
  getServiceFormRoute: (id) => `/dashboard/${id}`,
};
// App.tsx
import { lazy, Suspense, useState, useCallback } from 'react';
import { VoiceAgentProvider, VoiceOnboarding, VoiceA11yAnnouncer } from '@unctad-ai/voice-agent-ui';
import type { OrbState } from '@unctad-ai/voice-agent-core';
import { siteConfig } from './voice-config';

const GlassCopilotPanel = lazy(() =>
  import('@unctad-ai/voice-agent-ui').then(m => ({ default: m.GlassCopilotPanel }))
);

export default function App() {
  const [isOpen, setIsOpen] = useState(false);
  const [orbState, setOrbState] = useState<OrbState>('idle');

  return (
    <VoiceAgentProvider config={siteConfig}>
      {/* Your app routes here */}

      {!isOpen && <VoiceOnboarding onTryNow={() => setIsOpen(true)} />}
      <Suspense fallback={null}>
        <GlassCopilotPanel
          isOpen={isOpen}
          onOpen={() => setIsOpen(true)}
          onClose={() => setIsOpen(false)}
          onStateChange={setOrbState}
        />
      </Suspense>
      <VoiceA11yAnnouncer isOpen={isOpen} orbState={orbState} />
    </VoiceAgentProvider>
  );
}

Wire Up the Server

// server/index.ts
import express from 'express';
import { createVoiceRoutes } from '@unctad-ai/voice-agent-server';
import { siteConfig } from '../voice-config';

const app = express();
app.use(express.json());

const voice = createVoiceRoutes(siteConfig);
app.post('/api/chat', voice.chat);
app.post('/api/stt', voice.stt);
app.post('/api/tts', voice.tts);

app.listen(3001);

Architecture

flowchart LR
    subgraph BROWSER [" Browser "]
        direction TB
        A["🎤 Mic → VAD → STT"]
        B["GlassCopilotPanel"]
        C["TTS → 🔊 Speaker"]
        A -- transcript --> B
        B -- AI response --> C
        B -.- D["Registries\n(forms · navigation)"]
    end

    subgraph SERVER [" Server · Express "]
        direction TB
        S1["/api/stt"]
        S2["/api/chat"]
        S3["/api/tts"]
    end

    subgraph PROVIDERS [" Providers "]
        direction TB
        P1["Kyutai STT\n(self-hosted)"]
        P2["Groq API\n(cloud)"]
        P3["TTS Engine\n(self-hosted or cloud)"]
    end

    A -- "audio" --> S1
    S1 -- "text" --> A
    B -- "message" --> S2
    S2 -- "stream" --> B
    C -- "request" --> S3
    S3 -- "audio" --> C

    S1 --> P1
    S1 -.->|fallback| P2
    S2 --> P2
    S3 --> P3
Loading

Voice Pipeline

  1. VAD — TenVAD runs in-browser via WebAssembly, detects when the user starts and stops speaking
  2. STT — Audio sent to server for transcription (see providers below)
  3. LLM — Transcript sent to Groq API with tool calling for search, navigation, form filling
  4. TTS — LLM response streamed back as audio with barge-in support (see providers below)

Providers

Each stage of the pipeline uses a configurable provider set via environment variables.

STT (Speech-to-Text)

Provider Type Set via Notes
Kyutai Self-hosted STT_PROVIDER=kyutai Default. Runs as a sidecar container. Falls back to Groq Whisper on failure
Groq Whisper Cloud API STT_PROVIDER=groq Uses whisper-large-v3-turbo via Groq API. Also serves as automatic fallback for Kyutai

LLM (Chat)

Provider Type Set via Notes
Groq Cloud API GROQ_API_KEY Default model: openai/gpt-oss-120b. Override with GROQ_MODEL

TTS (Text-to-Speech)

Set with TTS_PROVIDER. Each self-hosted provider falls back to Pocket TTS, then to Resemble (cloud).

Provider Type Set via Notes
Qwen3-TTS Self-hosted (GPU) TTS_PROVIDER=qwen3-tts Token-level streaming, ~200ms TTFA. Requires GPU server
Chatterbox Turbo Self-hosted (GPU) TTS_PROVIDER=chatterbox-turbo Sentence-level pipelining. Requires GPU server
CosyVoice Self-hosted (GPU) TTS_PROVIDER=cosyvoice Alibaba's voice synthesis. Requires GPU server
Pocket TTS Self-hosted (CPU) TTS_PROVIDER=pocket-tts Runs on CPU at ~0.5x realtime. Deployed as Docker sidecar. Middle fallback for GPU providers
Resemble Cloud API TTS_PROVIDER=resemble Resemble AI streaming API. Last-resort fallback for all other providers

Fallback Chains

qwen3-tts → pocket-tts → resemble
chatterbox-turbo → pocket-tts → resemble
cosyvoice → pocket-tts → resemble
pocket-tts → resemble
resemble (no fallback)

Registries

Registries let consuming apps dynamically expose their UI to the voice agent.

Form Fields

import { useRegisterFormField } from '@unctad-ai/voice-agent-registries';

function CompanyNameInput() {
  const [value, setValue] = useState('');

  useRegisterFormField({
    id: 'companyName',
    label: 'Company Name',
    type: 'text',
    required: true,
    setter: setValue,
    group: 'company-details',
  });

  return <input value={value} onChange={e => setValue(e.target.value)} />;
}

The voice agent can now fill this field via the fillFormFields tool.

UI Actions

import { useRegisterUIAction } from '@unctad-ai/voice-agent-registries';

function Dashboard() {
  useRegisterUIAction({
    id: 'openSettings',
    label: 'Open Settings',
    category: 'navigation',
    handler: () => navigate('/settings'),
  });

  return <div>...</div>;
}

UI Components

Component Purpose
VoiceAgentProvider Context provider — wraps your app with SiteConfig
GlassCopilotPanel Main floating panel (392px, glass morphism, collapsed/expanded)
VoiceOnboarding First-time user prompt to try the voice agent
VoiceA11yAnnouncer Screen reader live region for state changes
AgentAvatar Copilot portrait with state-based visual effects
VoiceOrb Animated speaking/processing indicator
VoiceWaveformCanvas Real-time audio waveform visualization
VoiceControls Mic, stop, volume controls
VoiceSettingsView User preferences (volume, speed, auto-listen, timeouts)
VoiceToolCard Displays tool execution results inline
VoiceTranscript Conversation transcript display
VoiceErrorBoundary Error boundary with recovery UI

Configuration

SiteConfig

interface SiteConfig {
  // Identity
  copilotName: string;          // Display name ("Pesa", "Tashi")
  siteTitle: string;            // Site name shown in UI
  farewellMessage: string;      // Said when closing session
  systemPromptIntro: string;    // LLM system prompt prefix

  // Branding
  colors: {
    primary: string;            // Main accent color
    processing: string;         // Shown during STT/thinking
    speaking: string;           // Shown during TTS playback
    glow: string;               // Orb glow effect
    error?: string;             // Error state
  };

  // Domain data
  services: ServiceBase[];      // Searchable service catalog
  categories: CategoryBase[];   // Grouped for browsing
  synonyms: Record<string, string[]>;  // Fuzzy search mappings
  categoryMap: Record<string, string>; // Category aliases
  routeMap: Record<string, string>;    // Named routes
  getServiceFormRoute: (serviceId: string) => string | null;

  // Optional
  avatarUrl?: string;
  extraServerTools?: Record<string, unknown>;
  thresholdOverrides?: Partial<VoiceThresholds>;
}

Voice Thresholds

Fine-tune VAD sensitivity via thresholdOverrides:

{
  positiveSpeechThreshold: 0.8,   // Confidence to start recording
  negativeSpeechThreshold: 0.4,   // Confidence to stop
  minSpeechFrames: 5,             // Min frames before accepting
  redemptionFrames: 15,           // ~600ms grace period
  minAudioRms: 0.005,             // Minimum volume level
}

Development

Prerequisites

  • Node.js 22+
  • pnpm 10+

Setup

git clone https://github.com/unctad-ai/voice-agent-kit.git
cd voice-agent-kit
pnpm install

Commands

pnpm dev          # Watch mode (all packages)
pnpm build        # Build all packages
pnpm typecheck    # Type-check all packages

Project Structure

voice-agent-kit/
├── packages/
│   ├── core/           # Hooks, types, config, audio utilities
│   ├── ui/             # React components (tsup build)
│   ├── registries/     # Form/UI action registries
│   └── server/         # Express handlers (chat, STT, TTS)
├── scripts/
│   └── validate-release.sh
├── .changeset/         # Version management
├── .github/workflows/
│   ├── ci.yml          # Typecheck + validate on push/PR
│   └── publish.yml     # Publish to npm on v* tags
└── .husky/
    └── pre-commit      # Typecheck gate

Release

pnpm changeset        # Describe what changed (interactive)
git add . && git commit -m "chore: add changeset"

pnpm release          # Bumps versions + validates (clean build, dist check, dry-run publish)
git add . && git commit -m "chore: release vX.Y.Z"
git tag vX.Y.Z
git push --follow-tags && git push origin vX.Y.Z
# CI publishes to npm automatically

All four packages use fixed versioning — they always share the same version number.

License

ISC

About

Pluggable voice agent toolkit for government service portals — config-driven, site-agnostic

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors