docs: add tts agent docs

rachwalk · rachwalk · commit be38bc7f33ec · 2025-05-19T16:12:39.000+02:00
diff --git a/docs/speech_to_speech/agents/asr.md b/docs/speech_to_speech/agents/asr.md
@@ -53,9 +53,9 @@ Adds a custom VAD model to a processing pipeline.
 
 -   `pipeline` can be either `'record'` or `'stop'`
 
-### `__call__()`
+!!! note "`'stop'` pipeline"
 
-Alias to `run()` method to enable usage like a callable object.
+    The `'stop'` pipeline is present for forward compatibility. It currently doesn't affect Agent's functioning.
 
 ### `_on_new_sample()`
 
@@ -74,10 +74,6 @@ Handles transcription for a given buffer in a background thread. Uses locks to e
 
 Evaluates the `should_record_pipeline` models to determine if recording should begin.
 
-### `_send_ros2_message(data, topic)`
-
-Sends a message to the given ROS2 topic, either a plain string or structured HRI message.
-
 ## Best Practices
 
 1. **Graceful Shutdown**: Always call `stop()` to ensure transcription threads complete.
diff --git a/docs/speech_to_speech/agents/tts.md b/docs/speech_to_speech/agents/tts.md
@@ -0,0 +1,79 @@
+# TextToSpeechAgent
+
+## Overview
+
+The `TextToSpeechAgent` in the RAI framework is a modular agent responsible for converting incoming text into audio using a text-to-speech (TTS) model and playing it through a configured audio output device. It supports real-time playback control through ROS2 messages and handles asynchronous speech processing using threads and queues.
+
+## Class Definition
+
+??? info "TextToSpeechAgent class definition"
+
+    ::: rai_s2s.tts.agents.TextToSpeechAgent
+
+## Purpose
+
+The `TextToSpeechAgent` enables:
+
+-   Real-time conversion of text to speech
+-   Playback control (play/pause/stop) via ROS2 messages
+-   Dynamic loading of TTS models from configuration
+-   Robust audio handling using queues and event-driven logic
+-   Integration with human-robot interaction topics (HRI)
+
+## Initialization Parameters
+
+| Parameter            | Type                       | Description                                             |
+| -------------------- | -------------------------- | ------------------------------------------------------- |
+| `speaker_config`     | `SoundDeviceConfig`        | Configuration for the audio output (speaker).           |
+| `ros2_name`          | `str`                      | Name of the ROS2 node.                                  |
+| `tts`                | `TTSModel`                 | Text-to-speech model instance.                          |
+| `logger`             | `Optional[logging.Logger]` | Logger instance, or default logger if `None`.           |
+| `max_speech_history` | `int`                      | Number of speech message IDs to remember (default: 64). |
+
+## Key Methods
+
+### `from_config(cfg_path: Optional[str])`
+
+Instantiates the agent from a configuration file, dynamically selecting the TTS model and setting up audio output.
+
+### `run()`
+
+Initializes the agent:
+
+-   Starts a thread to handle queued text-to-speech conversion
+-   Launches speaker playback via `SoundDeviceConnector`
+
+### `stop()`
+
+Gracefully stops the agent by setting the termination flag and joining the transcription thread.
+
+## Communication
+
+The Agent uses the `ROS2HRIConnector` for connection through 2 ROS2 topics:
+
+-   `/to_human`: Incoming text messages to convert. Uses `rai_interfaces/msg/HRIMessage`.
+-   `/voice_commands`: Playback control with ROS2 `std_msgs/msg/String`. Valid values: `"play"`, `"pause"`, `"stop"`
+
+## Best Practices
+
+1. **Queue Management**: Properly track transcription IDs to avoid queue collisions or memory leaks.
+2. **Playback Sync**: Ensure audio queues are flushed on `stop` to avoid replaying outdated speech.
+3. **Graceful Shutdown**: Always call `stop()` to terminate threads cleanly.
+4. **Model Configuration**: Ensure model-specific settings (e.g., voice selection for ElevenLabs) are defined in config files.
+
+## Architecture
+
+The `TextToSpeechAgent` interacts with the following core components:
+
+-   **TTSModel**: Converts text into audio (e.g., ElevenLabsTTS, OpenTTS)
+-   **SoundDeviceConnector**: Sends synthesized audio to output hardware
+-   **ROS2HRIConnector**: Handles incoming HRI and command messages
+-   **Queues and Threads**: Enable asynchronous and buffered audio processing
+
+## See Also
+
+-   [BaseAgent](../agents/overview.md#baseagent): Abstract base for all agents in RAI
+-   [SoundDeviceConnector](../connectors/sound_device_connector.md): For details on speaker configuration and streaming
+-   [Text-to-Speech Models](../models/tts_models.md): Supported TTS engines and usage
+-   [ROS2 HRI Messaging](../connectors/ros2_connector.md): Interfacing with `/to_human` and `/voice_commands`
+-   [Agent Configuration](../configuration/overview.md): Configuring TTS agents using YAML
diff --git a/docs/speech_to_speech/models/overview.md b/docs/speech_to_speech/models/overview.md
@@ -0,0 +1,124 @@
+# Models
+
+## Overview
+
+This package provides three primary types of models:
+
+-   **Voice Activity Detection (VAD)**
+-   **Wake Word Detection**
+-   **Transcription**
+
+These models are designed with simple and consistent interfaces to allow chaining and integration into audio processing pipelines.
+
+## Model Interfaces
+
+### VAD and Wake Word Detection API
+
+All VAD and Wake Word detection models implement a common `detect` interface:
+
+```python
+    def detect(
+        self, audio_data: NDArray, input_parameters: dict[str, Any]
+    ) -> Tuple[bool, dict[str, Any]]:
+```
+
+This design supports chaining multiple models together by passing the output dictionary (`input_parameters`) from one model into the next.
+
+### Transcription API
+
+Transcription models implement the `transcribe` method:
+
+```python
+    def transcribe(self, data: NDArray[np.int16]) -> str:
+```
+
+This method takes raw audio data encoded as 2-byte integers and returns the corresponding text transcription.
+
+## Included Models
+
+### SileroVAD
+
+-   Open source model: [GitHub](https://github.com/snakers4/silero-vad)
+-   No additional setup required
+-   Returns a confidence value indicating the presence of speech in the audio
+
+### OpenWakeWord
+
+-   Open source project: [GitHub](https://github.com/dscripka/openWakeWord)
+-   Supports predefined and custom wake words
+-   Returns `True` when the specified wake word is detected in the audio
+
+### OpenAIWhisper
+
+-   Cloud-based transcription model: [Documentation](https://platform.openai.com/docs/guides/speech-to-text)
+-   Requires setting the `OPEN_API_KEY` environment variable
+-   Offers language and model customization via the API
+
+### LocalWhisper
+
+-   Local deployment of OpenAI Whisper: [GitHub](https://github.com/openai/whisper)
+-   Supports GPU acceleration
+-   Same configuration interface as OpenAIWhisper
+
+### FasterWhisper
+
+-   Optimized Whisper variant: [GitHub](https://github.com/SYSTRAN/faster-whisper)
+-   Designed for high speed and low memory usage
+-   Follows the same API as Whisper models
+
+### ElevenLabs
+
+-   Cloud-based TTS model: [Website](https://elevenlabs.io/)
+-   Requires the environment variable `ELEVENLABS_API_KEY` with a valid key
+
+### OpenTTS
+
+-   Open source TTS solution: [GitHub](https://github.com/synesthesiam/opentts)
+-   Easy setup via Docker:
+
+```bash
+ docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak
+```
+
+-   Provides a TTS server running on port 5500
+-   Supports multiple voices and configurations
+
+## Custom Models
+
+### Voice Detection Models
+
+To implement a custom VAD or Wake Word model, inherit from `rai_asr.base.BaseVoiceDetectionModel` and implement the following methods:
+
+```python
+class MyDetectionModel(BaseVoiceDetectionModel):
+    def detect(self, audio_data: NDArray, input_parameters: dict[str, Any]) -> Tuple[bool, dict[str, Any]]:
+        ...
+
+    def reset(self):
+        ...
+```
+
+### Transcription Models
+
+To implement a custom transcription model, inherit from `rai_asr.base.BaseTranscriptionModel` and implement:
+
+```python
+class MyTranscriptionModel(BaseTranscriptionModel):
+    def transcribe(self, data: NDArray[np.int16]) -> str:
+        ...
+```
+
+### TTS Models
+
+To create a custom TTS model, inherit from `rai_tts.models.base.TTSModel` and implement the required interface:
+
+```python
+class MyTTSModel(TTSModel):
+    def get_speech(self, text: str) -> AudioSegment:
+        ...
+        return AudioSegment()
+
+    def get_tts_params(self) -> Tuple[int, int]:
+        ...
+        return sample_rate, channels
+```