-
Notifications
You must be signed in to change notification settings - Fork 445
Fix OpenAI Realtime API transcription test #2127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add intent=transcription to WebSocket URL for transcription-only sessions - Add session.type = transcription in session.update payload - Implement audio_to_message method to wrap audio in base64-encoded JSON events - Add InputAudioBufferAppend struct for proper audio event serialization - Update live.rs to transform audio stream before passing to WebSocket client - Add configurable sample rate support (OpenAI requires 24kHz PCM) - Add speech_started and speech_stopped event handlers for better debugging - Add base64 dependency for audio encoding Co-Authored-By: yujonglee <[email protected]>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
✅ Deploy Preview for hyprnote ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for hyprnote-storybook ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughAdds an audio-to-message conversion hook to RealtimeSttAdapter, updates OpenAI adapter to emit base64 JSON audio payloads and intent-based WS URLs, refactors live client to send transformed Message objects (TransformedInput/TransformedDualInput), adds base64 dependency and rate-aware test helpers. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)
40-48: Consider handling serialization error instead of.unwrap().While
InputAudioBufferAppendserialization is unlikely to fail, using.unwrap()on line 47 could panic. Consider graceful error handling or document why panic is acceptable here.fn audio_to_message(&self, audio: bytes::Bytes) -> Message { use base64::Engine; let base64_audio = base64::engine::general_purpose::STANDARD.encode(&audio); let event = InputAudioBufferAppend { event_type: "input_audio_buffer.append".to_string(), audio: base64_audio, }; - Message::Text(serde_json::to_string(&event).unwrap().into()) + // Safe: InputAudioBufferAppend contains only String fields which always serialize + Message::Text(serde_json::to_string(&event).expect("InputAudioBufferAppend serialization").into()) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
owhisper/owhisper-client/Cargo.toml(1 hunks)owhisper/owhisper-client/src/adapter/mod.rs(1 hunks)owhisper/owhisper-client/src/adapter/openai/live.rs(11 hunks)owhisper/owhisper-client/src/adapter/openai/mod.rs(4 hunks)owhisper/owhisper-client/src/live.rs(8 hunks)owhisper/owhisper-client/src/test_utils.rs(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
owhisper/owhisper-client/src/adapter/mod.rs (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)
audio_to_message(40-48)
owhisper/owhisper-client/src/adapter/openai/live.rs (3)
owhisper/owhisper-client/src/adapter/mod.rs (2)
build_ws_url(41-41)audio_to_message(59-61)owhisper/owhisper-client/src/adapter/openai/mod.rs (1)
build_ws_url_from_base(24-53)owhisper/owhisper-client/src/test_utils.rs (2)
run_dual_test_with_rate(124-174)run_single_test_with_rate(74-115)
owhisper/owhisper-client/src/live.rs (2)
crates/ws/tests/client_tests.rs (2)
to_input(24-26)to_message(28-30)crates/ws/src/client.rs (2)
to_input(41-41)to_message(42-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Devin
- GitHub Check: Redirect rules - hyprnote
- GitHub Check: Header rules - hyprnote
- GitHub Check: Pages changed - hyprnote
- GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
- GitHub Check: desktop_ci (macos, depot-macos-14)
- GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
- GitHub Check: fmt
🔇 Additional comments (20)
owhisper/owhisper-client/Cargo.toml (1)
1-35: Dependency placement and ordering look appropriate.The
base64dependency is correctly placed in the runtime[dependencies]section rather than[dev-dependencies], and maintains alphabetical ordering within the dependency list. This aligns with the audio-to-message transformation requirements described in the PR summary.owhisper/owhisper-client/src/adapter/mod.rs (1)
59-61: LGTM! Clean trait extension with a sensible default.The default implementation returning
Message::Binary(audio)maintains backward compatibility for adapters that send raw binary audio, while allowing adapters like OpenAI to override with custom encoding (base64 text).owhisper/owhisper-client/src/adapter/openai/mod.rs (3)
6-6: LGTM! Clearer constant naming.Renaming to
DEFAULT_TRANSCRIPTION_MODELbetter describes its purpose in the transcription context.
60-84: LGTM! Tests properly updated.Tests cover the main scenarios: empty base (default URL), proxy path, and localhost handling. All assertions correctly reflect the new
intent=transcriptionparameter behavior.
24-53: Parameterintent=transcriptioncorrectly aligns with OpenAI Realtime API requirements for transcription sessions.The implementation is correct. OpenAI's Realtime API officially supports the
intentquery parameter withtranscriptionas the value for transcription-only sessions (as opposed toconversationmode). The code properly constructs WebSocket URLs with this parameter and prevents duplicate intent parameters. The change from themodelparameter tointent=transcriptionis the appropriate approach for this use case.owhisper/owhisper-client/src/adapter/openai/live.rs (6)
19-30: LGTM! Simplified URL construction.Ignoring unused parameters and delegating to
build_ws_url_from_basekeeps the implementation clean. Query parameters are properly appended.
61-74: LGTM! Dynamic sample rate configuration.Using
params.sample_rateinstead of a hardcoded value allows proper rate handling. Fallback toDEFAULT_TRANSCRIPTION_MODELis appropriate.
130-137: LGTM! New event handling for VAD events.Handling
InputAudioBufferSpeechStartedandInputAudioBufferSpeechStoppedevents with debug tracing improves observability of the transcription flow.
250-255: LGTM! New struct for audio append events.
InputAudioBufferAppendcorrectly models the OpenAI event structure with proper serde rename fortypefield.
274-277: LGTM! Enum variants for speech detection events.New variants properly map to OpenAI's
input_audio_buffer.speech_startedandinput_audio_buffer.speech_stoppedevent types.
359-395: LGTM! Tests properly parameterized with sample rate.Using
OPENAI_SAMPLE_RATE = 24000constant and passing it consistently toListenParamsandrun_*_test_with_ratefunctions ensures OpenAI's required 24kHz sample rate is used.owhisper/owhisper-client/src/test_utils.rs (4)
26-33: LGTM! Clean backward-compatible refactor.Renaming to
default_sample_rate()and having existing functions delegate to rate-aware variants maintains API compatibility while adding flexibility.
35-48: LGTM! Rate-aware audio stream generation.Parameterizing
sample_rateallows tests to match provider-specific requirements (e.g., OpenAI's 24kHz).
74-115: LGTM! Rate-aware test runner for single-channel tests.Properly uses the sample_rate parameter to generate the appropriate test audio stream.
124-174: LGTM! Rate-aware test runner for dual-channel tests.Mirrors the single-channel implementation correctly for dual-channel scenarios.
owhisper/owhisper-client/src/live.rs (5)
116-117: LGTM! Clear type alias for transformed messages.
TransformedInputproperly represents the audio data after adapter transformation, maintaining theMixedMessagepattern for audio vs control messages.
146-147: LGTM! TransformedDualInput captures both raw and transformed data.The tuple
(bytes::Bytes, bytes::Bytes, Message)allows passing both raw mic/speaker audio (for potential interleaving) and the pre-transformed message.
205-216: LGTM! Audio stream transformation centralized through adapter.Cloning the adapter for the closure and mapping each audio message through
audio_to_messagecleanly centralizes the encoding logic.
257-270: LGTM! Native multichannel transformation with interleaving.The transformation correctly interleaves mic/speaker audio before calling
audio_to_message, then packages all three pieces intoTransformedDualInput.
356-376: LGTM! Split-channel forwarding with per-channel transformation.Each channel's audio is independently transformed via
adapter.audio_to_message, correctly handling the split WebSocket case.
Co-Authored-By: yujonglee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
owhisper/owhisper-client/src/live.rs (2)
146-162: Consider removing unused tuple elements fromTransformedDualInput.The
TransformedDualInputtype carries the originalmicandspeakerbytes in a tuple alongside the transformedMessage(line 146), but these bytes are immediately discarded into_inputat line 157. This means every dual audio chunk carries unnecessary data through the channel.Simplify the type to avoid the overhead:
-pub type TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>; +pub type TransformedDualInput = MixedMessage<Message, ControlMessage>;Then update line 259 in
from_realtime_audio_native:- TransformedDualInput::Audio((mic, speaker, msg)) + TransformedDualInput::Audio(msg)And simplify lines 157-159:
- TransformedDualInput::Audio((_, _, transform_fn_result)) => { - TransformedInput::Audio(transform_fn_result) + TransformedDualInput::Audio(msg) => { + TransformedInput::Audio(msg)
357-371: Consider consistent error handling or logging for dropped messages.The function silently ignores send errors at lines 362-363 (
try_send) and 366-367 (send). While acceptable in a spawned task with no error propagation path, you might want to:
- Use consistent send methods (all
try_sendor allsend().await)- Log dropped audio or control messages for debugging
- Add a comment explaining the intentional silent drop
Current behavior is acceptable since channel closure typically indicates WebSocket termination, making message delivery moot.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
owhisper/owhisper-client/src/live.rs(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
owhisper/owhisper-client/src/live.rs (3)
crates/ws/src/client.rs (2)
to_input(41-41)to_message(42-42)owhisper/owhisper-client/src/adapter/parsing.rs (1)
speaker(87-90)owhisper/owhisper-client/src/lib.rs (1)
adapter(56-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Redirect rules - hyprnote-storybook
- GitHub Check: Header rules - hyprnote-storybook
- GitHub Check: Pages changed - hyprnote-storybook
- GitHub Check: Redirect rules - hyprnote
- GitHub Check: Header rules - hyprnote
- GitHub Check: Pages changed - hyprnote
- GitHub Check: fmt
- GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
- GitHub Check: desktop_ci (macos, depot-macos-14)
- GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
- GitHub Check: Devin
🔇 Additional comments (6)
owhisper/owhisper-client/src/live.rs (6)
116-116: LGTM!The
TransformedInputtype alias clearly represents the post-transformation state where audio data has been converted to aMessageby the adapter.
121-136: LGTM!The
ListenClientIOimplementation correctly handles the newTransformedInputtype. Audio messages are already transformed by the adapter, so they're passed through directly, while control messages are serialized to JSON.
200-211: LGTM!The transformation logic correctly routes audio through
adapter.audio_to_message()before WebSocket transmission, while control messages pass through unchanged. The adapter clone is necessary for the closure.
252-265: LGTM!The native multichannel transformation correctly interleaves mic and speaker audio before passing it through
adapter.audio_to_message(). The past review concern about redundantinterleave_audiocalls has been addressed—interleaving now occurs only once here (line 257), not into_input.
290-311: LGTM!The split-path channel setup correctly uses
TransformedInputtypes and passes the adapter toforward_dual_to_singlefor audio transformation.
351-356: LGTM!The updated signature correctly parameterizes the function with the adapter and uses
TransformedInputchannel types, enabling audio transformation in the split-path scenario.
Fix OpenAI Realtime API transcription test
Summary
Fixes the failing OpenAI Realtime API transcription test by implementing the correct API protocol for transcription-only sessions.
Key changes:
intent=transcriptionURL parameter instead ofmodelparameter for transcription sessionsinput_audio_buffer.append) instead of raw binary WebSocket messagesaudio_to_messagemethod toRealtimeSttAdaptertrait with default binary passthrough for backward compatibilityparams.sample_ratespeech_started/speech_stoppedevent handlers for debuggingUpdates since last revision
interleave_audiocall inListenClientDualIO::to_input(code review feedback)Review & Testing Checklist for Human
live.rsmodify how audio streams are transformed before sending to WebSocket. Run tests for Deepgram, AssemblyAI, and Soniox adapters to ensure the defaultaudio_to_messageimplementation (raw binary) maintains backward compatibilityTransformedDualInputtype inlive.rs:145-162carries pre-transformed messages - verify this logic is correct for native multichannel providersinput_audio_buffer.appendJSON structure with base64 audio matches OpenAI's expected format per their docsRecommended test plan:
Notes
TEST_TIMEOUT_SECS=30because OpenAI's VAD needs time to detect speech boundaries before returning transcriptionbase64dependency (v0.22.1, matching workspace version)Link to Devin run: https://app.devin.ai/sessions/0e8cdca88bb14e52a1b645f66978d1f7
Requested by: yujonglee (@yujonglee)