|
| 1 | +# Models |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This package provides three primary types of models: |
| 6 | + |
| 7 | +- **Voice Activity Detection (VAD)** |
| 8 | +- **Wake Word Detection** |
| 9 | +- **Transcription** |
| 10 | + |
| 11 | +These models are designed with simple and consistent interfaces to allow chaining and integration into audio processing pipelines. |
| 12 | + |
| 13 | +## Model Interfaces |
| 14 | + |
| 15 | +### VAD and Wake Word Detection API |
| 16 | + |
| 17 | +All VAD and Wake Word detection models implement a common `detect` interface: |
| 18 | + |
| 19 | +```python |
| 20 | + def detect( |
| 21 | + self, audio_data: NDArray, input_parameters: dict[str, Any] |
| 22 | + ) -> Tuple[bool, dict[str, Any]]: |
| 23 | +``` |
| 24 | + |
| 25 | +This design supports chaining multiple models together by passing the output dictionary (`input_parameters`) from one model into the next. |
| 26 | + |
| 27 | +### Transcription API |
| 28 | + |
| 29 | +Transcription models implement the `transcribe` method: |
| 30 | + |
| 31 | +```python |
| 32 | + def transcribe(self, data: NDArray[np.int16]) -> str: |
| 33 | +``` |
| 34 | + |
| 35 | +This method takes raw audio data encoded as 2-byte integers and returns the corresponding text transcription. |
| 36 | + |
| 37 | +## Included Models |
| 38 | + |
| 39 | +### SileroVAD |
| 40 | + |
| 41 | +- Open source model: [GitHub](https://github.com/snakers4/silero-vad) |
| 42 | +- No additional setup required |
| 43 | +- Returns a confidence value indicating the presence of speech in the audio |
| 44 | + |
| 45 | +### OpenWakeWord |
| 46 | + |
| 47 | +- Open source project: [GitHub](https://github.com/dscripka/openWakeWord) |
| 48 | +- Supports predefined and custom wake words |
| 49 | +- Returns `True` when the specified wake word is detected in the audio |
| 50 | + |
| 51 | +### OpenAIWhisper |
| 52 | + |
| 53 | +- Cloud-based transcription model: [Documentation](https://platform.openai.com/docs/guides/speech-to-text) |
| 54 | +- Requires setting the `OPEN_API_KEY` environment variable |
| 55 | +- Offers language and model customization via the API |
| 56 | + |
| 57 | +### LocalWhisper |
| 58 | + |
| 59 | +- Local deployment of OpenAI Whisper: [GitHub](https://github.com/openai/whisper) |
| 60 | +- Supports GPU acceleration |
| 61 | +- Same configuration interface as OpenAIWhisper |
| 62 | + |
| 63 | +### FasterWhisper |
| 64 | + |
| 65 | +- Optimized Whisper variant: [GitHub](https://github.com/SYSTRAN/faster-whisper) |
| 66 | +- Designed for high speed and low memory usage |
| 67 | +- Follows the same API as Whisper models |
| 68 | + |
| 69 | +### ElevenLabs |
| 70 | + |
| 71 | +- Cloud-based TTS model: [Website](https://elevenlabs.io/) |
| 72 | +- Requires the environment variable `ELEVENLABS_API_KEY` with a valid key |
| 73 | + |
| 74 | +### OpenTTS |
| 75 | + |
| 76 | +- Open source TTS solution: [GitHub](https://github.com/synesthesiam/opentts) |
| 77 | +- Easy setup via Docker: |
| 78 | + |
| 79 | +```bash |
| 80 | + docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak |
| 81 | +``` |
| 82 | + |
| 83 | +- Provides a TTS server running on port 5500 |
| 84 | +- Supports multiple voices and configurations |
| 85 | + |
| 86 | +## Custom Models |
| 87 | + |
| 88 | +### Voice Detection Models |
| 89 | + |
| 90 | +To implement a custom VAD or Wake Word model, inherit from `rai_asr.base.BaseVoiceDetectionModel` and implement the following methods: |
| 91 | + |
| 92 | +```python |
| 93 | +class MyDetectionModel(BaseVoiceDetectionModel): |
| 94 | + def detect(self, audio_data: NDArray, input_parameters: dict[str, Any]) -> Tuple[bool, dict[str, Any]]: |
| 95 | + ... |
| 96 | + |
| 97 | + def reset(self): |
| 98 | + ... |
| 99 | +``` |
| 100 | + |
| 101 | +### Transcription Models |
| 102 | + |
| 103 | +To implement a custom transcription model, inherit from `rai_asr.base.BaseTranscriptionModel` and implement: |
| 104 | + |
| 105 | +```python |
| 106 | +class MyTranscriptionModel(BaseTranscriptionModel): |
| 107 | + def transcribe(self, data: NDArray[np.int16]) -> str: |
| 108 | + ... |
| 109 | +``` |
| 110 | + |
| 111 | +### TTS Models |
| 112 | + |
| 113 | +To create a custom TTS model, inherit from `rai_tts.models.base.TTSModel` and implement the required interface: |
| 114 | + |
| 115 | +```python |
| 116 | +class MyTTSModel(TTSModel): |
| 117 | + def get_speech(self, text: str) -> AudioSegment: |
| 118 | + ... |
| 119 | + return AudioSegment() |
| 120 | + |
| 121 | + def get_tts_params(self) -> Tuple[int, int]: |
| 122 | + ... |
| 123 | + return sample_rate, channels |
| 124 | +``` |
0 commit comments