|
| 1 | +# SpeechRecognitionAgent |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The `SpeechRecognitionAgent` in the RAI framework is a specialized agent that performs voice activity detection (VAD), audio recording, and transcription. It integrates tightly with audio input sources and ROS2 messaging, allowing it to serve as a real-time voice interface for robotic systems. |
| 6 | + |
| 7 | +This agent manages multiple pipelines for detecting when to start and stop recording, performs transcription using configurable models, and broadcasts messages to relevant ROS2 topics. |
| 8 | + |
| 9 | +## Class Definition |
| 10 | + |
| 11 | +??? info "SpeechRecognitionAgent class definition" |
| 12 | + |
| 13 | + ::: rai_s2s.asr.agents.asr_agent.SpeechRecognitionAgent |
| 14 | + |
| 15 | +## Purpose |
| 16 | + |
| 17 | +The `SpeechRecognitionAgent` class enables real-time voice processing with the following responsibilities: |
| 18 | + |
| 19 | +- Detecting speech through VAD |
| 20 | +- Managing recording state and grace periods |
| 21 | +- Buffering and threading transcription processes |
| 22 | +- Publishing transcriptions and control messages to ROS2 topics |
| 23 | +- Supporting multiple VAD and transcription model types |
| 24 | + |
| 25 | +## Initialization Parameters |
| 26 | + |
| 27 | +| Parameter | Type | Description | |
| 28 | +| --------------------- | -------------------------- | ----------------------------------------------------------------------------- | |
| 29 | +| `microphone_config` | `SoundDeviceConfig` | Configuration for the microphone input. | |
| 30 | +| `ros2_name` | `str` | Name of the ROS2 node. | |
| 31 | +| `transcription_model` | `BaseTranscriptionModel` | Model instance for transcribing speech. | |
| 32 | +| `vad` | `BaseVoiceDetectionModel` | Model for detecting voice activity. | |
| 33 | +| `grace_period` | `float` | Time (in seconds) to continue buffering after speech ends. Defaults to `1.0`. | |
| 34 | +| `logger` | `Optional[logging.Logger]` | Logger instance. If `None`, defaults to module logger. | |
| 35 | + |
| 36 | +## Key Methods |
| 37 | + |
| 38 | +### `from_config()` |
| 39 | + |
| 40 | +Creates a `SpeechRecognitionAgent` instance from a YAML config file. Dynamically loads the required transcription and VAD models. |
| 41 | + |
| 42 | +### `run()` |
| 43 | + |
| 44 | +Starts the microphone stream and handles incoming audio samples. |
| 45 | + |
| 46 | +### `stop()` |
| 47 | + |
| 48 | +Stops the agent gracefully, joins all running transcription threads, and shuts down ROS2 connectors. |
| 49 | + |
| 50 | +### `add_detection_model(model, pipeline="record")` |
| 51 | + |
| 52 | +Adds a custom VAD model to a processing pipeline. |
| 53 | + |
| 54 | +- `pipeline` can be either `'record'` or `'stop'` |
| 55 | + |
| 56 | +### `__call__()` |
| 57 | + |
| 58 | +Alias to `run()` method to enable usage like a callable object. |
| 59 | + |
| 60 | +### `_on_new_sample()` |
| 61 | + |
| 62 | +Callback function triggered for each new audio sample. Determines: |
| 63 | + |
| 64 | +- If recording should start |
| 65 | +- Whether to continue buffering |
| 66 | +- If grace period has ended |
| 67 | +- When to start transcription threads |
| 68 | + |
| 69 | +### `_transcription_thread(identifier)` |
| 70 | + |
| 71 | +Handles transcription for a given buffer in a background thread. Uses locks to ensure safe access to transcription model. |
| 72 | + |
| 73 | +### `_should_record(audio_data, input_parameters)` |
| 74 | + |
| 75 | +Evaluates the `should_record_pipeline` models to determine if recording should begin. |
| 76 | + |
| 77 | +### `_send_ros2_message(data, topic)` |
| 78 | + |
| 79 | +Sends a message to the given ROS2 topic, either a plain string or structured HRI message. |
| 80 | + |
| 81 | +## Best Practices |
| 82 | + |
| 83 | +1. **Graceful Shutdown**: Always call `stop()` to ensure transcription threads complete. |
| 84 | +2. **Model Compatibility**: Ensure all transcription and VAD models are compatible with the sample rate (typically 16 kHz). |
| 85 | +3. **Thread Safety**: Use provided locks for shared state, especially around the transcription model. |
| 86 | +4. **Logging**: Utilize `self.logger` for debug and info logs to aid in tracing activity. |
| 87 | +5. **Config-driven Design**: Use `from_config()` to ensure modular and portable deployment. |
| 88 | + |
| 89 | +## Architecture |
| 90 | + |
| 91 | +The `SpeechRecognitionAgent` typically interacts with the following components: |
| 92 | + |
| 93 | +- **SoundDeviceConnector**: Interfaces with microphone audio input. |
| 94 | +- **BaseVoiceDetectionModel**: Determines whether speech is present. |
| 95 | +- **BaseTranscriptionModel**: Converts speech audio into text. |
| 96 | +- **ROS2Connector / ROS2HRIConnector**: Publishes transcription and control messages to ROS2 topics. |
| 97 | +- **Config Loader**: Dynamically creates agent from structured config files. |
| 98 | + |
| 99 | +## See Also |
| 100 | + |
| 101 | +- [BaseAgent](../agents/overview.md): Abstract agent class providing lifecycle and logging support. |
| 102 | +- [ROS2 Connectors](../connectors/ros2_connector.md): Communication layer for ROS2 topics. |
| 103 | +- [Models](../models/overview.md): For available voice based models and instructions for creating new ones. |
| 104 | +- [TextToSpeech](tts.md): For TextToSpeechAgent meant for distributed deployment. |
0 commit comments