Skip to content

Feature/speech to text#82

Merged
g4rb4g3 merged 20 commits into
masterfrom
feature/speech-to-text
Jun 11, 2026
Merged

Feature/speech to text#82
g4rb4g3 merged 20 commits into
masterfrom
feature/speech-to-text

Conversation

@g4rb4g3

@g4rb4g3 g4rb4g3 commented May 15, 2026

Copy link
Copy Markdown
Contributor

This PR adds speech to text support

@SamuelBrucksch

Copy link
Copy Markdown
Contributor

@claude please review this PR

@claude

claude Bot commented May 19, 2026

Copy link
Copy Markdown

Claude finished @SamuelBrucksch's task in 2m 43s —— View job


PR Review: Feature/speech-to-text

  • Gather context and understand the PR
  • Review new TypeScript specs and types
  • Review new HybridVoice module (TS wrapper)
  • Review iOS native implementation changes
  • Review Android native implementation changes
  • Review example app changes
  • Review generated nitrogen code (spot check)
  • Post comprehensive review

Summary

This PR extracts voice input functionality from HybridAutoPlay into a dedicated HybridVoice NitroModule, and adds speech-to-text (STT) support alongside the existing raw PCM recording. The new API supports streaming partial results via an onChunk callback, and falls back to PCM when STT is unavailable. This is a well-structured refactor with a clean separation of concerns.


Architecture & Design — Looks Good

  • Clean module split: Voice functionality is properly extracted into its own Voice NitroModule with a dedicated spec, hybrid wrapper, and native implementations on both platforms. This is a better separation than having it embedded in AutoPlay.
  • Fallback strategy is solid: Both platforms gracefully fall back from STT → PCM when speech recognition is unavailable (missing authorization, recognizer not available, API level too low).
  • Streaming API design: The onChunk callback with VoiceInputChunk (containing either partial transcription or audio data) is a clean abstraction. The TypeScript overloads in HybridVoice.ts that distinguish the return type based on whether onChunk is provided are well-typed.

Issues & Suggestions

1. iOS: Thread safety of samples array in audio tap (Medium)

File: packages/react-native-autoplay/ios/utils/VoiceInputManager.swift

The samples array is mutated from two different threads:

  • Audio tap callback (real-time audio thread, line 233): self.samples.append(contentsOf: newSamples)
  • stop() method (via stopLock, line 107): reads and clears samples

While stop() properly uses stopLock to synchronize its read of samples, the audio tap callback writes to samples without acquiring stopLock. The isStopping flag check (line 208) may provide some implicit ordering, but there's a TOCTOU window: the tap could check isStopping == false, then stop() captures samples, then the tap appends — potentially causing a data race on Swift's Array (which is not thread-safe).

Similarly, recognitionTask's error handler (line 161-162) reads and clears samples under stopLock, but the audio tap could be appending concurrently.

Suggestion: Either protect samples.append with stopLock, or switch to a lock-free approach with a concurrent queue.

Fix this →


2. iOS: hasVoiceInputPermission now requires Speech authorization — breaking change? (Medium)

File: packages/react-native-autoplay/ios/hybrid/HybridVoice.swift, lines 8-11

The old hasVoiceInputPermission() only checked microphone permission. The new implementation checks both microphone AND speech recognition:

let micGranted = AVAudioSession.sharedInstance().recordPermission == .granted
let speechGranted = SFSpeechRecognizer.authorizationStatus() == .authorized
return micGranted && speechGranted

This means hasVoiceInputPermission() will return false for users who previously had only microphone permission, even if they only want PCM recording (not STT). Callers using preferSpeechToText: false still need to pass the permission check.

Suggestion: Consider either:

  • Only requiring speech authorization when the caller intends to use STT, or
  • Documenting this as a known behavioral change and requiring consumers to update their permission flow.

Fix this →


3. iOS: STT error handler may double-cleanup (Low)

File: packages/react-native-autoplay/ios/utils/VoiceInputManager.swift, lines 155-168

When the recognitionTask error callback fires, it calls self.cleanup(interfaceController:) which tears down the audio engine and calls recognitionRequest = nil. However, the audio engine tap might still be running at the moment cleanup is entered. The ResultBox protects against double-resume, which is good — but the engine could still deliver a buffer between the error callback's cleanup and the tap actually being removed. This is benign (the tap's guard !self.isStopping will bail), but the isStopping flag is not set in the error path.

Suggestion: Set self.isStopping = true in the error handler before calling cleanup, similar to how the isFinal path does it (line 173).

Fix this →


4. Android: startSTT error returns empty result instead of throwing (Low)

File: packages/react-native-autoplay/android/.../VoiceInputManager.kt, lines 108-112

When SpeechRecognizer.onError fires, the code silently returns an empty VoiceInputResult(transcription = null, audio = null) instead of indicating failure. This means callers can't distinguish "no speech detected" from "recognition service failed."

override fun onError(error: Int) {
    activeSpeechRecognizer = null
    recognizer.destroy()
    cont.resume(VoiceInputResult(transcription = null, audio = null))
}

The same pattern exists in startSTTWithSource (line 219). This is arguably a design choice, but consumers might want to know the difference.

Suggestion: Consider at least logging the error code, or adding an optional error field to VoiceInputResult.


5. Android: recordPCM uses suspendCancellableCoroutine but doesn't handle cancellation cleanly (Low)

File: packages/react-native-autoplay/android/.../VoiceInputManager.kt, line 288

The coroutine uses suspendCancellableCoroutine but doesn't register an invokeOnCancellation handler for the PCM recording path. If the coroutine is cancelled externally, releaseResources() won't be called until the finally block in the recording loop runs — but the loop checks isRecording which won't be set to false by cancellation.

Suggestion: Add cont.invokeOnCancellation { stop() } after setting up the recording, to mirror the cancellation handling in the STT paths.


6. iOS: HybridVoice.voiceInputManager is instance-level, not static (Low)

File: packages/react-native-autoplay/ios/hybrid/HybridVoice.swift, line 6

The old code stored voiceInputManager as a static property on HybridAutoPlay. The new HybridVoice stores it as an instance property. If NitroModules creates a new HybridVoice instance per JS call to createHybridObject, then calling stopVoiceInput() on a different instance than the one that started recording would be a no-op (since voiceInputManager would be nil).

This seems fine since the JS side caches the hybrid object in _native (line 5 of HybridVoice.ts), but worth verifying that NitroModules reuses the same native instance.


7. VoiceInputOptions type is defined but only used in the TS wrapper (Nit)

File: packages/react-native-autoplay/src/types/Voice.ts, lines 11-17

VoiceInputOptions is a nice TS-side convenience type exported from index.ts. This is fine — just noting that the nitro spec uses positional params while the wrapper uses the options pattern, which is the right call for ergonomics.


What Looks Good

  • ResultBox pattern (iOS): The lock-protected continuation wrapper is a solid approach to preventing double-resume from concurrent stop/recognition callbacks.
  • startSTTFromCarAudio pipe approach (Android): Feeding CarAudioRecord PCM through a pipe to SpeechRecognizer via EXTRA_AUDIO_SOURCE (API 33+) is clever and well-implemented, with proper cleanup in the finally block.
  • ThreadUtil simplification (Android): The refactor from Handler(Looper.getMainLooper()) to withContext(Dispatchers.Main) is cleaner.
  • Backward compatibility preserved on AutoPlay spec: addListenerVoiceInput remains on the AutoPlay spec (as an Android-only OS-triggered listener), while the new active voice API lives on Voice. Good separation.
  • Example app updated: Both iOS and Android header buttons now use HybridVoice with preferSpeechToText: true, exercising the new API.

Breaking Changes to Note

This PR is a breaking change for consumers who used the old voice API on HybridAutoPlay:

  • HybridAutoPlay.hasVoiceInputPermission(), requestVoiceInputPermission(), startVoiceInput(), and stopVoiceInput() have been removed
  • Consumers must switch to HybridVoice.* (newly exported from the package)
  • The return type of startVoiceInput changed from Promise<ArrayBuffer> to Promise<VoiceInputResult> (containing either transcription or audio)
  • hasVoiceInputPermission on iOS now requires speech recognition authorization in addition to microphone permission

Consider documenting migration steps in the release notes.


@g4rb4g3 g4rb4g3 marked this pull request as ready for review June 5, 2026 10:32
@g4rb4g3 g4rb4g3 requested a review from SamuelBrucksch June 5, 2026 10:33
@SamuelBrucksch

Copy link
Copy Markdown
Contributor

@claude please review

@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown

Claude finished @SamuelBrucksch's task in 5m 19s —— View job


PR Review: Feature/speech-to-text

  • Gather context and read changed source files
  • Review TypeScript specs and types
  • Review HybridVoice TS wrapper
  • Review iOS native implementation (HybridVoice.swift, VoiceInputManager.swift)
  • Review Android native implementation (HybridVoice.kt, VoiceInputManager.kt)
  • Review utility and example app changes
  • Post comprehensive review

I note the previous Claude review was addressed in part — the samples thread-safety issue (old #1) was fixed in 32bab2c. The remaining issues below are what's still present plus one new finding.


Issues

1. iOS: STT error handler still missing isStopping = true (Medium)

File: packages/react-native-autoplay/ios/utils/VoiceInputManager.swift, lines 161–170

The recognitionTask error callback grabs samples under stopLock and calls cleanup(), but never sets isStopping = true. The audio tap (line 211) checks !self.isStopping and can still process buffers, potentially triggering triggerAutoStop()stop() after the error handler has already cleaned up and resumed the continuation. ResultBox prevents a double-resume crash, but stop() would attempt redundant operations on already-nil'd state (e.g., recognitionRequest?.endAudio() on a nil reference).

Fix: Set self.isStopping = true inside the stopLock block in the error handler, matching the isFinal path at line 177:

if error != nil {
    self.stopLock.lock()
    self.isStopping = true  // ← add this
    let capturedSamples = self.samples
    self.samples = []
    self.stopLock.unlock()
    // ...
}

Fix this →


2. iOS: hasVoiceInputPermission() now gates on speech authorization — breaks PCM-only callers (Medium)

File: packages/react-native-autoplay/ios/hybrid/HybridVoice.swift, lines 8–11

let micGranted = AVAudioSession.sharedInstance().recordPermission == .granted
let speechGranted = SFSpeechRecognizer.authorizationStatus() == .authorized
return micGranted && speechGranted

Callers using preferSpeechToText: false (PCM-only) will fail the permission check if speech recognition isn't authorized. The JSDoc on the TS wrapper documents this, but the native behavior is misleading — a user who only wants raw PCM shouldn't need speech authorization. The Android side only checks RECORD_AUDIO.

Suggestion: Only require speech authorization when the caller intends to use STT (accept a parameter), or split into separate permission checks.

Fix this →


3. Android: onError silently returns empty result — caller can't distinguish failure from silence (Medium)

Files: packages/react-native-autoplay/android/.../VoiceInputManager.kt, lines 110–114 and 222–226

Both startSTT and startSTTWithSource resume with VoiceInputResult(transcription = null, audio = null) on error. This is indistinguishable from "speech recognition worked but detected nothing."

override fun onError(error: Int) {
    activeSpeechRecognizer = null
    recognizer.destroy()
    cont.resume(VoiceInputResult(transcription = null, audio = null))
}

At minimum, log the error code (SpeechRecognizer.ERROR_*). Consider falling back to PCM in startSTT (the non-car-audio path) similar to how iOS falls back.


4. iOS: recognitionRequest accessed without synchronization across threads (Low)

File: packages/react-native-autoplay/ios/utils/VoiceInputManager.swift

stop() calls recognitionRequest?.endAudio() at line 115 (outside stopLock), while cleanup() sets recognitionRequest = nil at line 297. These can execute concurrently — stop() from triggerAutoStop on a global queue, and cleanup() from the recognition task callback. Swift optional access is not atomic. In practice, optional chaining captures the reference before the nil assignment, so this is unlikely to crash, but it's technically a data race.

Suggestion: Either read recognitionRequest under stopLock in stop(), or make recognitionRequest an Atomic / protect it with the existing lock.


5. Android: startSTTFromCarAudio swallows pipe write errors, recording continues uselessly (Low)

File: packages/react-native-autoplay/android/.../VoiceInputManager.kt, lines 177–184

recordPCM(silenceThresholdMs, maxDurationMs) { chunk ->
    chunk.audio?.let { ab ->
        try {
            pipeOut.write(ab.toByteArray())
        } catch (_: Exception) {
        }
    }
}

If the recognizer crashes and closes the read end of the pipe, pipeOut.write() throws but the exception is swallowed. Recording continues accumulating PCM data that goes nowhere until silence detection or max duration stops it. Consider setting isRecording = false in the catch to stop early.


Previously Flagged — Lower Priority / Mitigated

  • Android recordPCM missing invokeOnCancellation: Mitigated by HybridVoice.kt's finally block calling dispose(), which cancels the scope.
  • HybridVoice instance vs static voiceInputManager: Mitigated by the JS wrapper caching the native object in _native.

Architecture

The refactor from embedding voice in AutoPlay to a dedicated Voice NitroModule is clean. The ResultBox pattern for safe continuation resumption, the pipe-based car audio → SpeechRecognizer bridge on Android (API 33+), and the TypeScript overloads on startVoiceInput are all well done.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

No significant issues found.

@g4rb4g3

g4rb4g3 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@SamuelBrucksch I fixed all of them except for 2, the permission request is fine like this imho. STT isn't a worse then raw mic access anyways.

Comment thread apps/example/src/App.tsx
title="record audio"
onPress={async () => {
const granted = await HybridAutoPlay.requestVoiceInputPermission();
const granted = await HybridVoice.requestVoiceInputPermission();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just minor cosmetics, but for consistency you could use the same code style as in AutoTemplate.ts -> requestVoiceInputPermission.then(() => ...)

@g4rb4g3 g4rb4g3 merged commit 07848a7 into master Jun 11, 2026
5 checks passed
@g4rb4g3 g4rb4g3 deleted the feature/speech-to-text branch June 11, 2026 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants