Skip to content

Commit be38bc7

Browse files
committed
docs: add tts agent docs
1 parent 35fa184 commit be38bc7

File tree

3 files changed

+205
-6
lines changed

3 files changed

+205
-6
lines changed

docs/speech_to_speech/agents/asr.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@ Adds a custom VAD model to a processing pipeline.
5353

5454
- `pipeline` can be either `'record'` or `'stop'`
5555

56-
### `__call__()`
56+
!!! note "`'stop'` pipeline"
5757

58-
Alias to `run()` method to enable usage like a callable object.
58+
The `'stop'` pipeline is present for forward compatibility. It currently doesn't affect Agent's functioning.
5959

6060
### `_on_new_sample()`
6161

@@ -74,10 +74,6 @@ Handles transcription for a given buffer in a background thread. Uses locks to e
7474

7575
Evaluates the `should_record_pipeline` models to determine if recording should begin.
7676

77-
### `_send_ros2_message(data, topic)`
78-
79-
Sends a message to the given ROS2 topic, either a plain string or structured HRI message.
80-
8177
## Best Practices
8278

8379
1. **Graceful Shutdown**: Always call `stop()` to ensure transcription threads complete.

docs/speech_to_speech/agents/tts.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# TextToSpeechAgent
2+
3+
## Overview
4+
5+
The `TextToSpeechAgent` in the RAI framework is a modular agent responsible for converting incoming text into audio using a text-to-speech (TTS) model and playing it through a configured audio output device. It supports real-time playback control through ROS2 messages and handles asynchronous speech processing using threads and queues.
6+
7+
## Class Definition
8+
9+
??? info "TextToSpeechAgent class definition"
10+
11+
::: rai_s2s.tts.agents.TextToSpeechAgent
12+
13+
## Purpose
14+
15+
The `TextToSpeechAgent` enables:
16+
17+
- Real-time conversion of text to speech
18+
- Playback control (play/pause/stop) via ROS2 messages
19+
- Dynamic loading of TTS models from configuration
20+
- Robust audio handling using queues and event-driven logic
21+
- Integration with human-robot interaction topics (HRI)
22+
23+
## Initialization Parameters
24+
25+
| Parameter | Type | Description |
26+
| -------------------- | -------------------------- | ------------------------------------------------------- |
27+
| `speaker_config` | `SoundDeviceConfig` | Configuration for the audio output (speaker). |
28+
| `ros2_name` | `str` | Name of the ROS2 node. |
29+
| `tts` | `TTSModel` | Text-to-speech model instance. |
30+
| `logger` | `Optional[logging.Logger]` | Logger instance, or default logger if `None`. |
31+
| `max_speech_history` | `int` | Number of speech message IDs to remember (default: 64). |
32+
33+
## Key Methods
34+
35+
### `from_config(cfg_path: Optional[str])`
36+
37+
Instantiates the agent from a configuration file, dynamically selecting the TTS model and setting up audio output.
38+
39+
### `run()`
40+
41+
Initializes the agent:
42+
43+
- Starts a thread to handle queued text-to-speech conversion
44+
- Launches speaker playback via `SoundDeviceConnector`
45+
46+
### `stop()`
47+
48+
Gracefully stops the agent by setting the termination flag and joining the transcription thread.
49+
50+
## Communication
51+
52+
The Agent uses the `ROS2HRIConnector` for connection through 2 ROS2 topics:
53+
54+
- `/to_human`: Incoming text messages to convert. Uses `rai_interfaces/msg/HRIMessage`.
55+
- `/voice_commands`: Playback control with ROS2 `std_msgs/msg/String`. Valid values: `"play"`, `"pause"`, `"stop"`
56+
57+
## Best Practices
58+
59+
1. **Queue Management**: Properly track transcription IDs to avoid queue collisions or memory leaks.
60+
2. **Playback Sync**: Ensure audio queues are flushed on `stop` to avoid replaying outdated speech.
61+
3. **Graceful Shutdown**: Always call `stop()` to terminate threads cleanly.
62+
4. **Model Configuration**: Ensure model-specific settings (e.g., voice selection for ElevenLabs) are defined in config files.
63+
64+
## Architecture
65+
66+
The `TextToSpeechAgent` interacts with the following core components:
67+
68+
- **TTSModel**: Converts text into audio (e.g., ElevenLabsTTS, OpenTTS)
69+
- **SoundDeviceConnector**: Sends synthesized audio to output hardware
70+
- **ROS2HRIConnector**: Handles incoming HRI and command messages
71+
- **Queues and Threads**: Enable asynchronous and buffered audio processing
72+
73+
## See Also
74+
75+
- [BaseAgent](../agents/overview.md#baseagent): Abstract base for all agents in RAI
76+
- [SoundDeviceConnector](../connectors/sound_device_connector.md): For details on speaker configuration and streaming
77+
- [Text-to-Speech Models](../models/tts_models.md): Supported TTS engines and usage
78+
- [ROS2 HRI Messaging](../connectors/ros2_connector.md): Interfacing with `/to_human` and `/voice_commands`
79+
- [Agent Configuration](../configuration/overview.md): Configuring TTS agents using YAML
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Models
2+
3+
## Overview
4+
5+
This package provides three primary types of models:
6+
7+
- **Voice Activity Detection (VAD)**
8+
- **Wake Word Detection**
9+
- **Transcription**
10+
11+
These models are designed with simple and consistent interfaces to allow chaining and integration into audio processing pipelines.
12+
13+
## Model Interfaces
14+
15+
### VAD and Wake Word Detection API
16+
17+
All VAD and Wake Word detection models implement a common `detect` interface:
18+
19+
```python
20+
def detect(
21+
self, audio_data: NDArray, input_parameters: dict[str, Any]
22+
) -> Tuple[bool, dict[str, Any]]:
23+
```
24+
25+
This design supports chaining multiple models together by passing the output dictionary (`input_parameters`) from one model into the next.
26+
27+
### Transcription API
28+
29+
Transcription models implement the `transcribe` method:
30+
31+
```python
32+
def transcribe(self, data: NDArray[np.int16]) -> str:
33+
```
34+
35+
This method takes raw audio data encoded as 2-byte integers and returns the corresponding text transcription.
36+
37+
## Included Models
38+
39+
### SileroVAD
40+
41+
- Open source model: [GitHub](https://github.com/snakers4/silero-vad)
42+
- No additional setup required
43+
- Returns a confidence value indicating the presence of speech in the audio
44+
45+
### OpenWakeWord
46+
47+
- Open source project: [GitHub](https://github.com/dscripka/openWakeWord)
48+
- Supports predefined and custom wake words
49+
- Returns `True` when the specified wake word is detected in the audio
50+
51+
### OpenAIWhisper
52+
53+
- Cloud-based transcription model: [Documentation](https://platform.openai.com/docs/guides/speech-to-text)
54+
- Requires setting the `OPEN_API_KEY` environment variable
55+
- Offers language and model customization via the API
56+
57+
### LocalWhisper
58+
59+
- Local deployment of OpenAI Whisper: [GitHub](https://github.com/openai/whisper)
60+
- Supports GPU acceleration
61+
- Same configuration interface as OpenAIWhisper
62+
63+
### FasterWhisper
64+
65+
- Optimized Whisper variant: [GitHub](https://github.com/SYSTRAN/faster-whisper)
66+
- Designed for high speed and low memory usage
67+
- Follows the same API as Whisper models
68+
69+
### ElevenLabs
70+
71+
- Cloud-based TTS model: [Website](https://elevenlabs.io/)
72+
- Requires the environment variable `ELEVENLABS_API_KEY` with a valid key
73+
74+
### OpenTTS
75+
76+
- Open source TTS solution: [GitHub](https://github.com/synesthesiam/opentts)
77+
- Easy setup via Docker:
78+
79+
```bash
80+
docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak
81+
```
82+
83+
- Provides a TTS server running on port 5500
84+
- Supports multiple voices and configurations
85+
86+
## Custom Models
87+
88+
### Voice Detection Models
89+
90+
To implement a custom VAD or Wake Word model, inherit from `rai_asr.base.BaseVoiceDetectionModel` and implement the following methods:
91+
92+
```python
93+
class MyDetectionModel(BaseVoiceDetectionModel):
94+
def detect(self, audio_data: NDArray, input_parameters: dict[str, Any]) -> Tuple[bool, dict[str, Any]]:
95+
...
96+
97+
def reset(self):
98+
...
99+
```
100+
101+
### Transcription Models
102+
103+
To implement a custom transcription model, inherit from `rai_asr.base.BaseTranscriptionModel` and implement:
104+
105+
```python
106+
class MyTranscriptionModel(BaseTranscriptionModel):
107+
def transcribe(self, data: NDArray[np.int16]) -> str:
108+
...
109+
```
110+
111+
### TTS Models
112+
113+
To create a custom TTS model, inherit from `rai_tts.models.base.TTSModel` and implement the required interface:
114+
115+
```python
116+
class MyTTSModel(TTSModel):
117+
def get_speech(self, text: str) -> AudioSegment:
118+
...
119+
return AudioSegment()
120+
121+
def get_tts_params(self) -> Tuple[int, int]:
122+
...
123+
return sample_rate, channels
124+
```

0 commit comments

Comments
 (0)