Skip to content

Conversation

@yujonglee
Copy link
Contributor

@yujonglee yujonglee commented Dec 5, 2025

Fix OpenAI Realtime API transcription test

Summary

Fixes the failing OpenAI Realtime API transcription test by implementing the correct API protocol for transcription-only sessions.

Key changes:

  • Use intent=transcription URL parameter instead of model parameter for transcription sessions
  • Wrap audio in base64-encoded JSON events (input_audio_buffer.append) instead of raw binary WebSocket messages
  • Add audio_to_message method to RealtimeSttAdapter trait with default binary passthrough for backward compatibility
  • Configure 24kHz sample rate (required by OpenAI) via params.sample_rate
  • Add speech_started/speech_stopped event handlers for debugging

Updates since last revision

  • Removed unused interleave_audio call in ListenClientDualIO::to_input (code review feedback)

Review & Testing Checklist for Human

  • Verify other adapters still work: The changes to live.rs modify how audio streams are transformed before sending to WebSocket. Run tests for Deepgram, AssemblyAI, and Soniox adapters to ensure the default audio_to_message implementation (raw binary) maintains backward compatibility
  • Review dual channel handling: The TransformedDualInput type in live.rs:145-162 carries pre-transformed messages - verify this logic is correct for native multichannel providers
  • Validate OpenAI API format: Confirm the input_audio_buffer.append JSON structure with base64 audio matches OpenAI's expected format per their docs

Recommended test plan:

# Run OpenAI test (requires API key, ~30s timeout needed)
TEST_TIMEOUT_SECS=30 OPENAI_API_KEY="..." cargo test -p owhisper-client adapter::openai::live::tests::test_build_single -- --ignored --nocapture

# Verify other adapters still work
cargo test -p owhisper-client adapter::deepgram
cargo test -p owhisper-client adapter::assemblyai

Notes

  • The test requires TEST_TIMEOUT_SECS=30 because OpenAI's VAD needs time to detect speech boundaries before returning transcription
  • Added base64 dependency (v0.22.1, matching workspace version)

Link to Devin run: https://app.devin.ai/sessions/0e8cdca88bb14e52a1b645f66978d1f7
Requested by: yujonglee (@yujonglee)

- Add intent=transcription to WebSocket URL for transcription-only sessions
- Add session.type = transcription in session.update payload
- Implement audio_to_message method to wrap audio in base64-encoded JSON events
- Add InputAudioBufferAppend struct for proper audio event serialization
- Update live.rs to transform audio stream before passing to WebSocket client
- Add configurable sample rate support (OpenAI requires 24kHz PCM)
- Add speech_started and speech_stopped event handlers for better debugging
- Add base64 dependency for audio encoding

Co-Authored-By: yujonglee <[email protected]>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR that start with 'DevinAI' or '@devin'.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@netlify
Copy link

netlify bot commented Dec 5, 2025

Deploy Preview for hyprnote ready!

Name Link
🔨 Latest commit e1835b4
🔍 Latest deploy log https://app.netlify.com/projects/hyprnote/deploys/693234d9e9fa1b0008eaefd0
😎 Deploy Preview https://deploy-preview-2127--hyprnote.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify
Copy link

netlify bot commented Dec 5, 2025

Deploy Preview for hyprnote-storybook ready!

Name Link
🔨 Latest commit e1835b4
🔍 Latest deploy log https://app.netlify.com/projects/hyprnote-storybook/deploys/693234d9bbfe110008677d9e
😎 Deploy Preview https://deploy-preview-2127--hyprnote-storybook.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

Adds an audio-to-message conversion hook to RealtimeSttAdapter, updates OpenAI adapter to emit base64 JSON audio payloads and intent-based WS URLs, refactors live client to send transformed Message objects (TransformedInput/TransformedDualInput), adds base64 dependency and rate-aware test helpers.

Changes

Cohort / File(s) Summary
Dependency Management
owhisper/owhisper-client/Cargo.toml
Added base64 = "0.22.1" dependency
Adapter Trait Extension
owhisper/owhisper-client/src/adapter/mod.rs
Added trait method audio_to_message(&self, audio: bytes::Bytes) -> Message with default Message::Binary(audio)
OpenAI Adapter Implementation
owhisper/owhisper-client/src/adapter/openai/live.rs
Implemented audio_to_message to produce base64-encoded InputAudioBufferAppend JSON wrapped as Message::Text; made WS URL building ignore ListenParams; replaced hardcoded PCM rate with params.sample_rate; added handling for new InputAudioBufferSpeechStarted/Stopped events; added debug traces and test rate wiring
OpenAI Adapter Utilities
owhisper/owhisper-client/src/adapter/openai/mod.rs
Renamed DEFAULT_MODELDEFAULT_TRANSCRIPTION_MODEL; refactored build_ws_url_from_base to remove model parameter and ensure intent=transcription is present
Live Client Refactoring
owhisper/owhisper-client/src/live.rs
Added TransformedInput = MixedMessage<Message, ControlMessage> and TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>; changed ListenClientIO / ListenClientDualIO to use transformed types; centralized calls to adapter.audio_to_message so audio streams produce Message objects before WS transmission; updated dual/split forwarding to accept adapter
Test Utilities
owhisper/owhisper-client/src/test_utils.rs
Renamed sample_rate()default_sample_rate(); added rate-parameterized helpers test_audio_stream_*_with_rate and run_*_test_with_rate; threaded explicit sample_rate through test runners and streams

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Mic as Microphone (realtime)
participant Adapter as RealtimeSttAdapter
participant Client as ListenClient / live.rs
participant WS as OpenAI WebSocket
participant Parser as OpenAIResponse Parser

Mic->>Client: audio bytes stream
Client->>Adapter: adapter.audio_to_message(bytes)
Adapter-->>Client: Message (Text JSON / Binary)
Client->>WS: send Message over WebSocket
WS-->>Parser: incoming event(s)
Parser-->>Client: OpenAIEvent (including SpeechStarted/Stopped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay attention to:
    • live.rs wiring: ensure all stream type changes (TransformedInput/TransformedDualInput) and conversions maintain expected message/control semantics.
    • OpenAI audio encoding: verify base64 serialization, JSON structure of InputAudioBufferAppend, and correct sample_rate usage.
    • build_ws_url changes: confirm intent parameter handling and no regressions for existing query parameters.
    • Tests: validate updated rate-aware test utilities and adjusted test constants.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main objective: fixing the OpenAI Realtime API transcription test. It is specific, directly related to the primary purpose of the PR.
Description check ✅ Passed The description comprehensively covers the changes made, including key protocol updates, API modifications, implementation details, and testing guidance. It is well-structured and directly related to the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1764895521-fix-openai-realtime-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)

40-48: Consider handling serialization error instead of .unwrap().

While InputAudioBufferAppend serialization is unlikely to fail, using .unwrap() on line 47 could panic. Consider graceful error handling or document why panic is acceptable here.

     fn audio_to_message(&self, audio: bytes::Bytes) -> Message {
         use base64::Engine;
         let base64_audio = base64::engine::general_purpose::STANDARD.encode(&audio);
         let event = InputAudioBufferAppend {
             event_type: "input_audio_buffer.append".to_string(),
             audio: base64_audio,
         };
-        Message::Text(serde_json::to_string(&event).unwrap().into())
+        // Safe: InputAudioBufferAppend contains only String fields which always serialize
+        Message::Text(serde_json::to_string(&event).expect("InputAudioBufferAppend serialization").into())
     }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34f8b16 and 51ae407.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • owhisper/owhisper-client/Cargo.toml (1 hunks)
  • owhisper/owhisper-client/src/adapter/mod.rs (1 hunks)
  • owhisper/owhisper-client/src/adapter/openai/live.rs (11 hunks)
  • owhisper/owhisper-client/src/adapter/openai/mod.rs (4 hunks)
  • owhisper/owhisper-client/src/live.rs (8 hunks)
  • owhisper/owhisper-client/src/test_utils.rs (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
owhisper/owhisper-client/src/adapter/mod.rs (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)
  • audio_to_message (40-48)
owhisper/owhisper-client/src/adapter/openai/live.rs (3)
owhisper/owhisper-client/src/adapter/mod.rs (2)
  • build_ws_url (41-41)
  • audio_to_message (59-61)
owhisper/owhisper-client/src/adapter/openai/mod.rs (1)
  • build_ws_url_from_base (24-53)
owhisper/owhisper-client/src/test_utils.rs (2)
  • run_dual_test_with_rate (124-174)
  • run_single_test_with_rate (74-115)
owhisper/owhisper-client/src/live.rs (2)
crates/ws/tests/client_tests.rs (2)
  • to_input (24-26)
  • to_message (28-30)
crates/ws/src/client.rs (2)
  • to_input (41-41)
  • to_message (42-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Devin
  • GitHub Check: Redirect rules - hyprnote
  • GitHub Check: Header rules - hyprnote
  • GitHub Check: Pages changed - hyprnote
  • GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
  • GitHub Check: desktop_ci (macos, depot-macos-14)
  • GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
  • GitHub Check: fmt
🔇 Additional comments (20)
owhisper/owhisper-client/Cargo.toml (1)

1-35: Dependency placement and ordering look appropriate.

The base64 dependency is correctly placed in the runtime [dependencies] section rather than [dev-dependencies], and maintains alphabetical ordering within the dependency list. This aligns with the audio-to-message transformation requirements described in the PR summary.

owhisper/owhisper-client/src/adapter/mod.rs (1)

59-61: LGTM! Clean trait extension with a sensible default.

The default implementation returning Message::Binary(audio) maintains backward compatibility for adapters that send raw binary audio, while allowing adapters like OpenAI to override with custom encoding (base64 text).

owhisper/owhisper-client/src/adapter/openai/mod.rs (3)

6-6: LGTM! Clearer constant naming.

Renaming to DEFAULT_TRANSCRIPTION_MODEL better describes its purpose in the transcription context.


60-84: LGTM! Tests properly updated.

Tests cover the main scenarios: empty base (default URL), proxy path, and localhost handling. All assertions correctly reflect the new intent=transcription parameter behavior.


24-53: Parameter intent=transcription correctly aligns with OpenAI Realtime API requirements for transcription sessions.

The implementation is correct. OpenAI's Realtime API officially supports the intent query parameter with transcription as the value for transcription-only sessions (as opposed to conversation mode). The code properly constructs WebSocket URLs with this parameter and prevents duplicate intent parameters. The change from the model parameter to intent=transcription is the appropriate approach for this use case.

owhisper/owhisper-client/src/adapter/openai/live.rs (6)

19-30: LGTM! Simplified URL construction.

Ignoring unused parameters and delegating to build_ws_url_from_base keeps the implementation clean. Query parameters are properly appended.


61-74: LGTM! Dynamic sample rate configuration.

Using params.sample_rate instead of a hardcoded value allows proper rate handling. Fallback to DEFAULT_TRANSCRIPTION_MODEL is appropriate.


130-137: LGTM! New event handling for VAD events.

Handling InputAudioBufferSpeechStarted and InputAudioBufferSpeechStopped events with debug tracing improves observability of the transcription flow.


250-255: LGTM! New struct for audio append events.

InputAudioBufferAppend correctly models the OpenAI event structure with proper serde rename for type field.


274-277: LGTM! Enum variants for speech detection events.

New variants properly map to OpenAI's input_audio_buffer.speech_started and input_audio_buffer.speech_stopped event types.


359-395: LGTM! Tests properly parameterized with sample rate.

Using OPENAI_SAMPLE_RATE = 24000 constant and passing it consistently to ListenParams and run_*_test_with_rate functions ensures OpenAI's required 24kHz sample rate is used.

owhisper/owhisper-client/src/test_utils.rs (4)

26-33: LGTM! Clean backward-compatible refactor.

Renaming to default_sample_rate() and having existing functions delegate to rate-aware variants maintains API compatibility while adding flexibility.


35-48: LGTM! Rate-aware audio stream generation.

Parameterizing sample_rate allows tests to match provider-specific requirements (e.g., OpenAI's 24kHz).


74-115: LGTM! Rate-aware test runner for single-channel tests.

Properly uses the sample_rate parameter to generate the appropriate test audio stream.


124-174: LGTM! Rate-aware test runner for dual-channel tests.

Mirrors the single-channel implementation correctly for dual-channel scenarios.

owhisper/owhisper-client/src/live.rs (5)

116-117: LGTM! Clear type alias for transformed messages.

TransformedInput properly represents the audio data after adapter transformation, maintaining the MixedMessage pattern for audio vs control messages.


146-147: LGTM! TransformedDualInput captures both raw and transformed data.

The tuple (bytes::Bytes, bytes::Bytes, Message) allows passing both raw mic/speaker audio (for potential interleaving) and the pre-transformed message.


205-216: LGTM! Audio stream transformation centralized through adapter.

Cloning the adapter for the closure and mapping each audio message through audio_to_message cleanly centralizes the encoding logic.


257-270: LGTM! Native multichannel transformation with interleaving.

The transformation correctly interleaves mic/speaker audio before calling audio_to_message, then packages all three pieces into TransformedDualInput.


356-376: LGTM! Split-channel forwarding with per-channel transformation.

Each channel's audio is independently transformed via adapter.audio_to_message, correctly handling the split WebSocket case.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
owhisper/owhisper-client/src/live.rs (2)

146-162: Consider removing unused tuple elements from TransformedDualInput.

The TransformedDualInput type carries the original mic and speaker bytes in a tuple alongside the transformed Message (line 146), but these bytes are immediately discarded in to_input at line 157. This means every dual audio chunk carries unnecessary data through the channel.

Simplify the type to avoid the overhead:

-pub type TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>;
+pub type TransformedDualInput = MixedMessage<Message, ControlMessage>;

Then update line 259 in from_realtime_audio_native:

-                TransformedDualInput::Audio((mic, speaker, msg))
+                TransformedDualInput::Audio(msg)

And simplify lines 157-159:

-            TransformedDualInput::Audio((_, _, transform_fn_result)) => {
-                TransformedInput::Audio(transform_fn_result)
+            TransformedDualInput::Audio(msg) => {
+                TransformedInput::Audio(msg)

357-371: Consider consistent error handling or logging for dropped messages.

The function silently ignores send errors at lines 362-363 (try_send) and 366-367 (send). While acceptable in a spawned task with no error propagation path, you might want to:

  1. Use consistent send methods (all try_send or all send().await)
  2. Log dropped audio or control messages for debugging
  3. Add a comment explaining the intentional silent drop

Current behavior is acceptable since channel closure typically indicates WebSocket termination, making message delivery moot.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51ae407 and e1835b4.

📒 Files selected for processing (1)
  • owhisper/owhisper-client/src/live.rs (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
owhisper/owhisper-client/src/live.rs (3)
crates/ws/src/client.rs (2)
  • to_input (41-41)
  • to_message (42-42)
owhisper/owhisper-client/src/adapter/parsing.rs (1)
  • speaker (87-90)
owhisper/owhisper-client/src/lib.rs (1)
  • adapter (56-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Redirect rules - hyprnote-storybook
  • GitHub Check: Header rules - hyprnote-storybook
  • GitHub Check: Pages changed - hyprnote-storybook
  • GitHub Check: Redirect rules - hyprnote
  • GitHub Check: Header rules - hyprnote
  • GitHub Check: Pages changed - hyprnote
  • GitHub Check: fmt
  • GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
  • GitHub Check: desktop_ci (macos, depot-macos-14)
  • GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
  • GitHub Check: Devin
🔇 Additional comments (6)
owhisper/owhisper-client/src/live.rs (6)

116-116: LGTM!

The TransformedInput type alias clearly represents the post-transformation state where audio data has been converted to a Message by the adapter.


121-136: LGTM!

The ListenClientIO implementation correctly handles the new TransformedInput type. Audio messages are already transformed by the adapter, so they're passed through directly, while control messages are serialized to JSON.


200-211: LGTM!

The transformation logic correctly routes audio through adapter.audio_to_message() before WebSocket transmission, while control messages pass through unchanged. The adapter clone is necessary for the closure.


252-265: LGTM!

The native multichannel transformation correctly interleaves mic and speaker audio before passing it through adapter.audio_to_message(). The past review concern about redundant interleave_audio calls has been addressed—interleaving now occurs only once here (line 257), not in to_input.


290-311: LGTM!

The split-path channel setup correctly uses TransformedInput types and passes the adapter to forward_dual_to_single for audio transformation.


351-356: LGTM!

The updated signature correctly parameterizes the function with the adapter and uses TransformedInput channel types, enabling audio transformation in the split-path scenario.

@yujonglee yujonglee merged commit 5a56b7a into main Dec 5, 2025
15 of 16 checks passed
@yujonglee yujonglee deleted the devin/1764895521-fix-openai-realtime-test branch December 5, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants