Build a React Native Speech to Text Dictation App

Voice interfaces have become essential in modern mobile applications. From dictation messages to voice commands and accessibility features, speech recognition enables more natural user interactions. Learn how to build a complete dictation application using React Native.

Understanding Speech Recognition in React Native

How Speech-to-Text Technology Works

Speech recognition technology converts spoken language into written text through a process involving audio capture, signal processing, and pattern matching. The technology has evolved from basic keyword detection to sophisticated language understanding that handles accents, context, and natural speech patterns.

The recognition process begins with capturing audio through the device's microphone. This audio signal undergoes processing to remove noise and extract meaningful features. The processed audio is then compared against acoustic models and language models to determine the most likely text representation.

Available Speech Recognition Libraries

The React Native ecosystem offers several libraries for implementing speech recognition, each with different capabilities and trade-offs:

react-native-voice: The most widely used solution providing unified interface to native speech recognition APIs on both iOS and Android
expo-speech-recognition: Expo-compatible alternative that wraps iOS SFSpeechRecognizer, Android SpeechRecognizer, and Web SpeechRecognition API
Platform-specific solutions: Custom native modules for advanced features beyond standard library capabilities

As covered in the LogRocket tutorial on React Native speech-to-text, the react-native-voice library provides comprehensive methods for starting and stopping recognition with event callbacks for results, partial results, errors, and recognition end.

For developers building comprehensive mobile solutions, integrating speech recognition alongside custom web development services creates powerful cross-platform experiences that leverage native device capabilities.

Comparing Speech Recognition Libraries

Choose the right library for your React Native project

react-native-voice

Most widely used cross-platform library. Uses Android SpeechRecognizer and iOS SFSpeechRecognizer. Provides start, stop, cancel, and destroy methods with comprehensive event callbacks.

expo-speech-recognition

Expo-managed workflow compatible. Implements native speech recognition for iOS, Android, and Web. Easier setup within Expo ecosystem.

Custom Native Modules

For advanced features like custom vocabulary, confidence scoring, and detailed analytics. Requires native iOS/Android development.

On-Device vs Cloud-Based Recognition

The choice between on-device and cloud-based speech recognition involves several factors:

Cloud-based recognition typically offers higher accuracy using large-scale language models. Services like Google Speech-to-Text and Amazon Transcribe handle complex vocabulary, multiple languages, and various accents effectively. However, they require network connectivity and may have usage costs.

On-device recognition has advanced significantly with compact neural network models. Benefits include zero latency, privacy since audio never leaves the device, and offline functionality. As noted in Picovoice's React Native speech recognition guide, modern on-device models achieve competitive accuracy for many use cases while providing enhanced privacy and reduced latency.

For applications requiring both approaches, a hybrid strategy uses on-device recognition for immediate feedback while sending audio to cloud services for higher accuracy when needed. This approach aligns with modern AI automation strategies that optimize for both performance and accuracy.

Setting Up the Project

Installing Dependencies

Building a speech-to-text dictation app requires installing appropriate dependencies:

# Installing react-native-voice
npm install @react-native-voice/voice

# iOS: Navigate to ios directory and run pod install
cd ios && pod install && cd ..

# Android: Permissions are added automatically through manifest merging

Platform-Specific Configuration

Android requires declaring the android.permission.RECORD_AUDIO permission and potentially android.permission.INTERNET for cloud-based recognition. The SpeechRecognizer service availability should be checked during application initialization.

iOS requires additional configuration in Info.plist for microphone usage description and speech recognition authorization. The SFSpeechRecognizer class also requires an active network connection for cloud-based recognition.

<!-- Android permissions in AndroidManifest.xml -->
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />

<!-- iOS permissions in Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>This app needs microphone access for speech recognition.</string>

As documented in the LogRocket implementation guide, proper permission handling is critical for speech recognition functionality to work correctly on both platforms.

Implementing Voice Recognition

Requesting Permissions

Proper permission handling is essential for speech recognition functionality:

import Voice from '@react-native-voice/voice';
import { Platform, PermissionsAndroid } from 'react-native';

const requestPermissions = async () => {
 if (Platform.OS === 'android') {
 const granted = await PermissionsAndroid.request(
 PermissionsAndroid.PERMISSIONS.RECORD_AUDIO,
 {
 title: 'Microphone Permission',
 message: 'This app needs microphone access for speech recognition.',
 buttonNeutral: 'Ask Me Later',
 buttonNegative: 'Cancel',
 buttonPositive: 'OK',
 }
 );
 return granted === PermissionsAndroid.RESULTS.GRANTED;
 }

 if (Platform.OS === 'ios') {
 try {
 const speechGrant = await Voice.requestSiriAuthorization();
 return speechGrant === 'authorized';
 } catch (error) {
 return false;
 }
 }
 return false;
};

Key points:

Android requires RECORD_AUDIO permission
iOS requires both microphone and speech recognition authorization
Permission requests should occur when user first attempts voice input
Well-crafted descriptions increase user trust and permission grants

The iOS speech recognition permission uses system caching, so subsequent requests will not prompt again if the user previously denied access. Applications should handle denial gracefully by providing alternative input methods.

Initializing Voice Recognition

After obtaining permissions, initialize the voice recognition module and set up event handlers:

const [isListening, setIsListening] = useState(false);
const [results, setResults] = useState<string[]>([]);
const [partialResults, setPartialResults] = useState<string[]>([]);

useEffect(() => {
 Voice.onSpeechStart = () => {
 setIsListening(true);
 setPartialResults([]);
 setResults([]);
 };

 Voice.onSpeechEnd = () => {
 setIsListening(false);
 };

 Voice.onSpeechResults = (event) => {
 setResults(event.value || []);
 };

 Voice.onSpeechPartialResults = (event) => {
 setPartialResults(event.value || []);
 };

 Voice.onSpeechError = (event) => {
 setIsListening(false);
 console.error('Speech recognition error:', event.error);
 };

 return () => {
 Voice.destroy().then(Voice.removeAllListeners);
 };
}, []);

The event handlers receive recognition results as arrays of strings, with the most likely transcription appearing first. Partial results provide real-time feedback as the user speaks, creating a more engaging user experience that shows the system is processing their speech correctly.

Starting and Stopping Recognition

The core functionality involves starting recognition when the user activates voice input and stopping it when they finish:

const startListening = async () => {
 const hasPermission = await requestPermissions();
 if (!hasPermission) {
 Alert.alert('Permission Required', 
 'Microphone access is required for speech recognition.');
 return;
 }
 try {
 await Voice.start('en-US');
 } catch (error) {
 console.error('Failed to start recognition:', error);
 }
};

const stopListening = async () => {
 try {
 await Voice.stop();
 } catch (error) {
 console.error('Failed to stop recognition:', error);
 }
};

const cancelListening = async () => {
 try {
 await Voice.cancel();
 } catch (error) {
 console.error('Failed to cancel recognition:', error);
 }
};

The start method accepts locale parameters for different languages. The available locales depend on the device's language settings and installed speech recognition packages. Recognition continues until the user stops speaking for a configured period, explicitly cancels, or an error occurs.

Building the Dictation Interface

Creating the Voice Input Component

The voice input component serves as the primary interaction point with clear visual feedback:

const VoiceInput = () => {
 const [isListening, setIsListening] = useState(false);
 const [partialText, setPartialText] = useState('');
 const [finalText, setFinalText] = useState('');

 const toggleListening = () => {
 if (isListening) {
 Voice.stop();
 } else {
 Voice.start('en-US');
 }
 };

 return (
 <View style={styles.container}>
 <TouchableOpacity
 style={[styles.micButton, isListening && styles.listening]}
 onPress={toggleListening}
 >
 <Text style={styles.micIcon}>{isListening ? '🔴' : '🎤'}</Text>
 </TouchableOpacity>
 {partialText ? <Text style={styles.partialText}>{partialText}</Text> : null}
 {finalText ? <Text style={styles.resultText}>{finalText}</Text> : null}
 </View>
 );
};

A well-designed component uses animation and color changes to indicate listening state. Recording indicators like pulsing animations or microphone icons help users understand their speech is being captured. The component should also display partial results as they become available, giving users confidence that their speech is being processed correctly.

This approach aligns with our mobile app development best practices for creating intuitive user interfaces that leverage native device capabilities.

Managing Recognition State

State management for voice recognition involves tracking listening status, recognized text, error states, and available locales:

interface RecognitionState {
 status: 'idle' | 'listening' | 'processing' | 'error';
 partialResults: string[];
 finalResults: string[];
 error?: string;
 availableLocales: string[];
 currentLocale: string;
}

const initialState: RecognitionState = {
 status: 'idle',
 partialResults: [],
 finalResults: [],
 error: undefined,
 availableLocales: [],
 currentLocale: 'en-US',
};

The recognition lifecycle includes states that the UI must reflect: idle shows microphone ready to start, listening shows active recording with partial results, processing indicates completion, and error states provide helpful messages and alternative actions.

As the Widlarz Group's analysis reveals, robust state management must handle edge cases such as recognition timeout, permission changes, and recognition unavailability.

Handling Different Locales

Supporting multiple languages requires configuring recognition with appropriate locale settings:

const getAvailableLocales = async () => {
 try {
 const locales = await Voice.getAvailableLocales();
 return locales;
 } catch (error) {
 return [];
 }
};

// Users should be able to select their preferred language
// The available locales depend on the device's installed speech recognition packages

Key considerations:

Check locale availability before presenting language options
Handle cases where the requested locale is unavailable
Provide default fallback locale
Store user preference for future sessions

Implementing multi-language support is essential for global-ready mobile applications that serve diverse user bases across different regions and language preferences.

Advanced Implementation Patterns

Error Handling and Recovery

Robust error handling distinguishes production-quality voice applications:

Voice.onSpeechError = (event) => {
 const { code, message } = event.error;
 switch (code) {
 case 'no-speech':
 handleNoSpeechError();
 break;
 case 'audio':
 handleAudioError();
 break;
 case 'network':
 handleNetworkError();
 break;
 case 'not-allowed':
 handlePermissionError();
 break;
 default:
 handleGenericError(message);
 }
};

Common error scenarios:

no-speech: User didn't speak or recognition timed out
audio: Microphone or audio configuration error
network: Required for cloud-based recognition
not-allowed: Permission denied by user

As documented by The Widlarz Group, each error type requires different handling strategies to provide users with helpful feedback and recovery options.

Custom Vocabulary and Context

Advanced applications can improve recognition accuracy by providing custom vocabulary. However, react-native-voice does not expose this functionality directly:

iOS SFSpeechRecognizer supports contextual strings that can improve recognition for expected phrases
Android SpeechRecognizer Intent extras allow configuration of recognition behavior
These features require custom native module implementation

For applications requiring domain-specific recognition (medical, legal terminology), custom native modules are necessary to access these advanced features. The Widlarz Group's advanced guide provides insights into creating custom native modules for enhanced speech recognition capabilities.

When building specialized applications like AI-powered mobile solutions, custom vocabulary support can significantly improve recognition accuracy for domain-specific terminology and jargon.

Continuous Recognition Mode

For live transcription or real-time translation, continuous recognition provides uninterrupted audio processing:

const startContinuousMode = async () => {
 await Voice.start('en-US', true); // Second parameter enables continuous
};

Voice.onSpeechResults = (event) => {
 const newSegment = event.value[0];
 setTranscript(prev => [...prev, newSegment]);
};

Continuous mode considerations:

Accumulates results over multiple utterances
Requires careful state management and result buffering
Higher battery and network implications
Provide clear feedback about session status

As noted in the LogRocket continuous recognition guide, continuous mode is ideal for scenarios like live transcription, meeting notes, or real-time translation services.

Testing and Optimization

Testing Speech Recognition

Testing voice-enabled applications requires both automated and manual approaches:

Automated tests can verify UI interactions and state management:

jest.mock('@react-native-voice/voice', () => ({
 start: jest.fn().mockResolvedValue(undefined),
 stop: jest.fn().mockResolvedValue(undefined),
 destroy: jest.fn().mockResolvedValue(undefined),
 onSpeechStart: jest.fn(),
 onSpeechResults: jest.fn(),
 onSpeechError: jest.fn(),
}));

it('updates state when speech results are received', () => {
 const mockResults = ['Hello world'];
 Voice.onSpeechResults({ value: mockResults });
 expect(useVoiceStore.getState().results).toEqual(mockResults);
});

Manual testing should cover diverse speakers, accents, speaking rates, and background noise levels. As Picovoice recommends, field testing with real users provides invaluable feedback about recognition quality and user experience.

Test scenarios should include various audio environments and speaking patterns to ensure robust functionality across different use cases.

Performance Optimization

Voice recognition impacts application performance through memory, battery, and UI responsiveness:

const cleanupRecognition = () => {
 Voice.removeAllListeners();
 Voice.destroy();
 Voice.cancel();
};

useEffect(() => {
 return () => {
 cleanupRecognition();
 };
}, []);

Optimization strategies:

Properly clean up resources when recognition is inactive
Debounce rapid start/stop requests
Limit recognition duration
Use on-device recognition when possible
Clear UI indication of recording state

According to the Widlarz Group's performance analysis, memory management involves destroying voice instances when no longer needed and removing event listeners to prevent memory leaks. Battery optimization can be achieved by limiting recognition duration and using on-device recognition when possible.

Conclusion

Building a speech-to-text dictation application in React Native involves understanding speech recognition technology, selecting appropriate libraries, implementing proper permissions handling, and creating intuitive user interfaces.

Key takeaways:

react-native-voice provides a solid foundation for most use cases
Advanced requirements may necessitate custom native module development
Thorough error handling ensures production-quality experience
Performance optimization prevents resource leaks
Testing across diverse scenarios ensures reliable functionality

The voice interface landscape continues to evolve with improvements in on-device recognition, expanded language support, and more accurate transcription. React Native developers who master these concepts create compelling applications leveraging natural voice interaction.

For teams looking to integrate voice capabilities into their mobile applications, partnering with experienced React Native developers can accelerate development and ensure production-ready implementations that scale with user needs. Additionally, combining voice interfaces with comprehensive web development services creates cohesive digital experiences that work seamlessly across web and mobile platforms.

Sources

Frequently Asked Questions

Ready to Build Voice-Powered Mobile Apps?

Our team specializes in React Native development with advanced features like speech recognition, voice interfaces, and AI integration.