Skip to content

Commit 35fa184

Browse files
committed
docs: add asr agent docs
1 parent 8b2298b commit 35fa184

File tree

1 file changed

+104
-0
lines changed
  • docs/speech_to_speech/agents

1 file changed

+104
-0
lines changed

docs/speech_to_speech/agents/asr.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# SpeechRecognitionAgent
2+
3+
## Overview
4+
5+
The `SpeechRecognitionAgent` in the RAI framework is a specialized agent that performs voice activity detection (VAD), audio recording, and transcription. It integrates tightly with audio input sources and ROS2 messaging, allowing it to serve as a real-time voice interface for robotic systems.
6+
7+
This agent manages multiple pipelines for detecting when to start and stop recording, performs transcription using configurable models, and broadcasts messages to relevant ROS2 topics.
8+
9+
## Class Definition
10+
11+
??? info "SpeechRecognitionAgent class definition"
12+
13+
::: rai_s2s.asr.agents.asr_agent.SpeechRecognitionAgent
14+
15+
## Purpose
16+
17+
The `SpeechRecognitionAgent` class enables real-time voice processing with the following responsibilities:
18+
19+
- Detecting speech through VAD
20+
- Managing recording state and grace periods
21+
- Buffering and threading transcription processes
22+
- Publishing transcriptions and control messages to ROS2 topics
23+
- Supporting multiple VAD and transcription model types
24+
25+
## Initialization Parameters
26+
27+
| Parameter | Type | Description |
28+
| --------------------- | -------------------------- | ----------------------------------------------------------------------------- |
29+
| `microphone_config` | `SoundDeviceConfig` | Configuration for the microphone input. |
30+
| `ros2_name` | `str` | Name of the ROS2 node. |
31+
| `transcription_model` | `BaseTranscriptionModel` | Model instance for transcribing speech. |
32+
| `vad` | `BaseVoiceDetectionModel` | Model for detecting voice activity. |
33+
| `grace_period` | `float` | Time (in seconds) to continue buffering after speech ends. Defaults to `1.0`. |
34+
| `logger` | `Optional[logging.Logger]` | Logger instance. If `None`, defaults to module logger. |
35+
36+
## Key Methods
37+
38+
### `from_config()`
39+
40+
Creates a `SpeechRecognitionAgent` instance from a YAML config file. Dynamically loads the required transcription and VAD models.
41+
42+
### `run()`
43+
44+
Starts the microphone stream and handles incoming audio samples.
45+
46+
### `stop()`
47+
48+
Stops the agent gracefully, joins all running transcription threads, and shuts down ROS2 connectors.
49+
50+
### `add_detection_model(model, pipeline="record")`
51+
52+
Adds a custom VAD model to a processing pipeline.
53+
54+
- `pipeline` can be either `'record'` or `'stop'`
55+
56+
### `__call__()`
57+
58+
Alias to `run()` method to enable usage like a callable object.
59+
60+
### `_on_new_sample()`
61+
62+
Callback function triggered for each new audio sample. Determines:
63+
64+
- If recording should start
65+
- Whether to continue buffering
66+
- If grace period has ended
67+
- When to start transcription threads
68+
69+
### `_transcription_thread(identifier)`
70+
71+
Handles transcription for a given buffer in a background thread. Uses locks to ensure safe access to transcription model.
72+
73+
### `_should_record(audio_data, input_parameters)`
74+
75+
Evaluates the `should_record_pipeline` models to determine if recording should begin.
76+
77+
### `_send_ros2_message(data, topic)`
78+
79+
Sends a message to the given ROS2 topic, either a plain string or structured HRI message.
80+
81+
## Best Practices
82+
83+
1. **Graceful Shutdown**: Always call `stop()` to ensure transcription threads complete.
84+
2. **Model Compatibility**: Ensure all transcription and VAD models are compatible with the sample rate (typically 16 kHz).
85+
3. **Thread Safety**: Use provided locks for shared state, especially around the transcription model.
86+
4. **Logging**: Utilize `self.logger` for debug and info logs to aid in tracing activity.
87+
5. **Config-driven Design**: Use `from_config()` to ensure modular and portable deployment.
88+
89+
## Architecture
90+
91+
The `SpeechRecognitionAgent` typically interacts with the following components:
92+
93+
- **SoundDeviceConnector**: Interfaces with microphone audio input.
94+
- **BaseVoiceDetectionModel**: Determines whether speech is present.
95+
- **BaseTranscriptionModel**: Converts speech audio into text.
96+
- **ROS2Connector / ROS2HRIConnector**: Publishes transcription and control messages to ROS2 topics.
97+
- **Config Loader**: Dynamically creates agent from structured config files.
98+
99+
## See Also
100+
101+
- [BaseAgent](../agents/overview.md): Abstract agent class providing lifecycle and logging support.
102+
- [ROS2 Connectors](../connectors/ros2_connector.md): Communication layer for ROS2 topics.
103+
- [Models](../models/overview.md): For available voice based models and instructions for creating new ones.
104+
- [TextToSpeech](tts.md): For TextToSpeechAgent meant for distributed deployment.

0 commit comments

Comments
 (0)