Skip to content

Conversation

@MsMatias
Copy link

This PR addresses several performance bottlenecks and architectural limitations in the current implementation:

  1. The file-based audio processing was causing unnecessary I/O overhead and disk writes
  2. The single VAD option (PyAnnote) is resource-intensive and overkill for many use cases
  3. The error handling for concurrent processing was insufficient, causing potential crashes
  4. The callback structure was tightly coupled to WebSockets, making it difficult to extend

Summary

This PR implements several significant improvements to the VoiceStreamAI speech recognition system:

  1. In-Memory Audio Processing: Replaced file-based audio handling with in-memory processing to improve performance and reduce I/O operations
  2. Added Silero VAD Support: Integrated Silero Voice Activity Detection as an alternative to PyAnnote
  3. Improved Callback System: Implemented a structured callback architecture for better event handling
  4. Fixed Language Detection: Corrected the language handling in Faster Whisper ASR
  5. Code Structure Reorganization: Better organized utility functions and improved logging

Key Technical Changes

Audio Processing

  • Eliminated temporary file operations by processing audio directly in memory
  • Created utility function convert_audio_bytes_to_numpy() for efficient conversion
  • Updated ASR components to work with in-memory audio data

Voice Activity Detection

  • Added Silero VAD implementation which is lighter and more efficient
  • Made Silero the new default VAD instead of PyAnnote
  • Restructured PyAnnote to work with in-memory buffers

Architecture Improvements

  • Implemented AudioProcessingCallbacks class for event-driven architecture
  • Improved error handling for concurrent audio processing
  • Fixed task cancellation for incomplete processing
  • Enhanced logging with customized binary filter

Configuration

  • Added new dependencies to requirements.txt (soundfile, silero-vad)
  • Optimized default chunk length for better responsiveness

Testing

These changes have been tested with both VAD types and verify that:

  • Transcription works correctly with in-memory audio
  • The callback system properly handles audio processing events
  • Language detection functions as expected
  • Processing is more efficient without temporary files
  • Task cancellation works as expected

Performance Improvements

  • Reduced Latency: Eliminating temporary file operations reduces processing time by ~200-300ms per audio chunk
  • Lower Resource Usage: Silero VAD uses significantly less memory than PyAnnote (~80MB vs ~2GB)
  • Improved Stability: Better error handling and task cancellation prevents resource leaks
  • Enhanced Scalability: The callback architecture makes it easier to add new processing steps or output methods

This PR addresses several issues with the original implementation and should improve overall performance and reliability of the system.

@alesaccoia
Copy link
Owner

Hey @MsMatias thanks for this PR! going through it, testing.
At first sight looks very good! but the client demo included in the framework is not working, can you confirm?
Also, would be great to update the test/ to be sure that with the mock websocket implementation, and also update the README maybe explaining that the silero is the default option now.

Let me know if you can go through it otherwise I can try and find some time next week.

@MsMatias
Copy link
Author

Hey @MsMatias thanks for this PR! going through it, testing. At first sight looks very good! but the client demo included in the framework is not working, can you confirm? Also, would be great to update the test/ to be sure that with the mock websocket implementation, and also update the README maybe explaining that the silero is the default option now.

Let me know if you can go through it otherwise I can try and find some time next week.

Hi @alesaccoia. Thanks for the feedback, indeed, I will take a look and update the tests as well.

@MsMatias
Copy link
Author

@alesaccoia Please, let me know if you think we should add something else.

@alesaccoia
Copy link
Owner

alesaccoia commented Apr 18, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants