Build a React Native Speech to Text Dictation App
Voice interfaces have become essential in modern mobile applications. From dictation messages to voice commands and accessibility features, speech recognition enables more natural user interactions. Learn how to build a complete dictation application using React Native.
Understanding Speech Recognition in React Native
How Speech-to-Text Technology Works
Speech recognition technology converts spoken language into written text through a process involving audio capture, signal processing, and pattern matching. The technology has evolved from basic keyword detection to sophisticated language understanding that handles accents, context, and natural speech patterns.
The recognition process begins with capturing audio through the device's microphone. This audio signal undergoes processing to remove noise and extract meaningful features. The processed audio is then compared against acoustic models and language models to determine the most likely text representation.
Available Speech Recognition Libraries
The React Native ecosystem offers several libraries for implementing speech recognition, each with different capabilities and trade-offs:
- react-native-voice: The most widely used solution providing unified interface to native speech recognition APIs on both iOS and Android
- expo-speech-recognition: Expo-compatible alternative that wraps iOS SFSpeechRecognizer, Android SpeechRecognizer, and Web SpeechRecognition API
- Platform-specific solutions: Custom native modules for advanced features beyond standard library capabilities
As covered in the LogRocket tutorial on React Native speech-to-text, the react-native-voice library provides comprehensive methods for starting and stopping recognition with event callbacks for results, partial results, errors, and recognition end.
For developers building comprehensive mobile solutions, integrating speech recognition alongside custom web development services creates powerful cross-platform experiences that leverage native device capabilities.
Choose the right library for your React Native project
react-native-voice
Most widely used cross-platform library. Uses Android SpeechRecognizer and iOS SFSpeechRecognizer. Provides start, stop, cancel, and destroy methods with comprehensive event callbacks.
expo-speech-recognition
Expo-managed workflow compatible. Implements native speech recognition for iOS, Android, and Web. Easier setup within Expo ecosystem.
Custom Native Modules
For advanced features like custom vocabulary, confidence scoring, and detailed analytics. Requires native iOS/Android development.
On-Device vs Cloud-Based Recognition
The choice between on-device and cloud-based speech recognition involves several factors:
Cloud-based recognition typically offers higher accuracy using large-scale language models. Services like Google Speech-to-Text and Amazon Transcribe handle complex vocabulary, multiple languages, and various accents effectively. However, they require network connectivity and may have usage costs.
On-device recognition has advanced significantly with compact neural network models. Benefits include zero latency, privacy since audio never leaves the device, and offline functionality. As noted in Picovoice's React Native speech recognition guide, modern on-device models achieve competitive accuracy for many use cases while providing enhanced privacy and reduced latency.
For applications requiring both approaches, a hybrid strategy uses on-device recognition for immediate feedback while sending audio to cloud services for higher accuracy when needed. This approach aligns with modern AI automation strategies that optimize for both performance and accuracy.
Setting Up the Project
Installing Dependencies
Building a speech-to-text dictation app requires installing appropriate dependencies:
# Installing react-native-voice
npm install @react-native-voice/voice
# iOS: Navigate to ios directory and run pod install
cd ios && pod install && cd ..
# Android: Permissions are added automatically through manifest merging
Platform-Specific Configuration
Android requires declaring the android.permission.RECORD_AUDIO permission and potentially android.permission.INTERNET for cloud-based recognition. The SpeechRecognizer service availability should be checked during application initialization.
iOS requires additional configuration in Info.plist for microphone usage description and speech recognition authorization. The SFSpeechRecognizer class also requires an active network connection for cloud-based recognition.
<!-- Android permissions in AndroidManifest.xml -->
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />
<!-- iOS permissions in Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>This app needs microphone access for speech recognition.</string>
As documented in the LogRocket implementation guide, proper permission handling is critical for speech recognition functionality to work correctly on both platforms.
Implementing Voice Recognition
Requesting Permissions
Proper permission handling is essential for speech recognition functionality:
import Voice from '@react-native-voice/voice';
import { Platform, PermissionsAndroid } from 'react-native';
const requestPermissions = async () => {
if (Platform.OS === 'android') {
const granted = await PermissionsAndroid.request(
PermissionsAndroid.PERMISSIONS.RECORD_AUDIO,
{
title: 'Microphone Permission',
message: 'This app needs microphone access for speech recognition.',
buttonNeutral: 'Ask Me Later',
buttonNegative: 'Cancel',
buttonPositive: 'OK',
}
);
return granted === PermissionsAndroid.RESULTS.GRANTED;
}
if (Platform.OS === 'ios') {
try {
const speechGrant = await Voice.requestSiriAuthorization();
return speechGrant === 'authorized';
} catch (error) {
return false;
}
}
return false;
};
Key points:
- Android requires RECORD_AUDIO permission
- iOS requires both microphone and speech recognition authorization
- Permission requests should occur when user first attempts voice input
- Well-crafted descriptions increase user trust and permission grants
The iOS speech recognition permission uses system caching, so subsequent requests will not prompt again if the user previously denied access. Applications should handle denial gracefully by providing alternative input methods.
Initializing Voice Recognition
After obtaining permissions, initialize the voice recognition module and set up event handlers:
const [isListening, setIsListening] = useState(false);
const [results, setResults] = useState<string[]>([]);
const [partialResults, setPartialResults] = useState<string[]>([]);
useEffect(() => {
Voice.onSpeechStart = () => {
setIsListening(true);
setPartialResults([]);
setResults([]);
};
Voice.onSpeechEnd = () => {
setIsListening(false);
};
Voice.onSpeechResults = (event) => {
setResults(event.value || []);
};
Voice.onSpeechPartialResults = (event) => {
setPartialResults(event.value || []);
};
Voice.onSpeechError = (event) => {
setIsListening(false);
console.error('Speech recognition error:', event.error);
};
return () => {
Voice.destroy().then(Voice.removeAllListeners);
};
}, []);
The event handlers receive recognition results as arrays of strings, with the most likely transcription appearing first. Partial results provide real-time feedback as the user speaks, creating a more engaging user experience that shows the system is processing their speech correctly.
Starting and Stopping Recognition
The core functionality involves starting recognition when the user activates voice input and stopping it when they finish:
const startListening = async () => {
const hasPermission = await requestPermissions();
if (!hasPermission) {
Alert.alert('Permission Required',
'Microphone access is required for speech recognition.');
return;
}
try {
await Voice.start('en-US');
} catch (error) {
console.error('Failed to start recognition:', error);
}
};
const stopListening = async () => {
try {
await Voice.stop();
} catch (error) {
console.error('Failed to stop recognition:', error);
}
};
const cancelListening = async () => {
try {
await Voice.cancel();
} catch (error) {
console.error('Failed to cancel recognition:', error);
}
};
The start method accepts locale parameters for different languages. The available locales depend on the device's language settings and installed speech recognition packages. Recognition continues until the user stops speaking for a configured period, explicitly cancels, or an error occurs.
Building the Dictation Interface
Creating the Voice Input Component
The voice input component serves as the primary interaction point with clear visual feedback:
const VoiceInput = () => {
const [isListening, setIsListening] = useState(false);
const [partialText, setPartialText] = useState('');
const [finalText, setFinalText] = useState('');
const toggleListening = () => {
if (isListening) {
Voice.stop();
} else {
Voice.start('en-US');
}
};
return (
<View style={styles.container}>
<TouchableOpacity
style={[styles.micButton, isListening && styles.listening]}
onPress={toggleListening}
>
<Text style={styles.micIcon}>{isListening ? '🔴' : '🎤'}</Text>
</TouchableOpacity>
{partialText ? <Text style={styles.partialText}>{partialText}</Text> : null}
{finalText ? <Text style={styles.resultText}>{finalText}</Text> : null}
</View>
);
};
A well-designed component uses animation and color changes to indicate listening state. Recording indicators like pulsing animations or microphone icons help users understand their speech is being captured. The component should also display partial results as they become available, giving users confidence that their speech is being processed correctly.
This approach aligns with our mobile app development best practices for creating intuitive user interfaces that leverage native device capabilities.
Managing Recognition State
State management for voice recognition involves tracking listening status, recognized text, error states, and available locales:
interface RecognitionState {
status: 'idle' | 'listening' | 'processing' | 'error';
partialResults: string[];
finalResults: string[];
error?: string;
availableLocales: string[];
currentLocale: string;
}
const initialState: RecognitionState = {
status: 'idle',
partialResults: [],
finalResults: [],
error: undefined,
availableLocales: [],
currentLocale: 'en-US',
};
The recognition lifecycle includes states that the UI must reflect: idle shows microphone ready to start, listening shows active recording with partial results, processing indicates completion, and error states provide helpful messages and alternative actions.
As the Widlarz Group's analysis reveals, robust state management must handle edge cases such as recognition timeout, permission changes, and recognition unavailability.
Handling Different Locales
Supporting multiple languages requires configuring recognition with appropriate locale settings:
const getAvailableLocales = async () => {
try {
const locales = await Voice.getAvailableLocales();
return locales;
} catch (error) {
return [];
}
};
// Users should be able to select their preferred language
// The available locales depend on the device's installed speech recognition packages
Key considerations:
- Check locale availability before presenting language options
- Handle cases where the requested locale is unavailable
- Provide default fallback locale
- Store user preference for future sessions
Implementing multi-language support is essential for global-ready mobile applications that serve diverse user bases across different regions and language preferences.
Advanced Implementation Patterns
Error Handling and Recovery
Robust error handling distinguishes production-quality voice applications:
Voice.onSpeechError = (event) => {
const { code, message } = event.error;
switch (code) {
case 'no-speech':
handleNoSpeechError();
break;
case 'audio':
handleAudioError();
break;
case 'network':
handleNetworkError();
break;
case 'not-allowed':
handlePermissionError();
break;
default:
handleGenericError(message);
}
};
Common error scenarios:
- no-speech: User didn't speak or recognition timed out
- audio: Microphone or audio configuration error
- network: Required for cloud-based recognition
- not-allowed: Permission denied by user
As documented by The Widlarz Group, each error type requires different handling strategies to provide users with helpful feedback and recovery options.
Custom Vocabulary and Context
Advanced applications can improve recognition accuracy by providing custom vocabulary. However, react-native-voice does not expose this functionality directly:
- iOS SFSpeechRecognizer supports contextual strings that can improve recognition for expected phrases
- Android SpeechRecognizer Intent extras allow configuration of recognition behavior
- These features require custom native module implementation
For applications requiring domain-specific recognition (medical, legal terminology), custom native modules are necessary to access these advanced features. The Widlarz Group's advanced guide provides insights into creating custom native modules for enhanced speech recognition capabilities.
When building specialized applications like AI-powered mobile solutions, custom vocabulary support can significantly improve recognition accuracy for domain-specific terminology and jargon.
Continuous Recognition Mode
For live transcription or real-time translation, continuous recognition provides uninterrupted audio processing:
const startContinuousMode = async () => {
await Voice.start('en-US', true); // Second parameter enables continuous
};
Voice.onSpeechResults = (event) => {
const newSegment = event.value[0];
setTranscript(prev => [...prev, newSegment]);
};
Continuous mode considerations:
- Accumulates results over multiple utterances
- Requires careful state management and result buffering
- Higher battery and network implications
- Provide clear feedback about session status
As noted in the LogRocket continuous recognition guide, continuous mode is ideal for scenarios like live transcription, meeting notes, or real-time translation services.
Testing and Optimization
Testing Speech Recognition
Testing voice-enabled applications requires both automated and manual approaches:
Automated tests can verify UI interactions and state management:
jest.mock('@react-native-voice/voice', () => ({
start: jest.fn().mockResolvedValue(undefined),
stop: jest.fn().mockResolvedValue(undefined),
destroy: jest.fn().mockResolvedValue(undefined),
onSpeechStart: jest.fn(),
onSpeechResults: jest.fn(),
onSpeechError: jest.fn(),
}));
it('updates state when speech results are received', () => {
const mockResults = ['Hello world'];
Voice.onSpeechResults({ value: mockResults });
expect(useVoiceStore.getState().results).toEqual(mockResults);
});
Manual testing should cover diverse speakers, accents, speaking rates, and background noise levels. As Picovoice recommends, field testing with real users provides invaluable feedback about recognition quality and user experience.
Test scenarios should include various audio environments and speaking patterns to ensure robust functionality across different use cases.
Performance Optimization
Voice recognition impacts application performance through memory, battery, and UI responsiveness:
const cleanupRecognition = () => {
Voice.removeAllListeners();
Voice.destroy();
Voice.cancel();
};
useEffect(() => {
return () => {
cleanupRecognition();
};
}, []);
Optimization strategies:
- Properly clean up resources when recognition is inactive
- Debounce rapid start/stop requests
- Limit recognition duration
- Use on-device recognition when possible
- Clear UI indication of recording state
According to the Widlarz Group's performance analysis, memory management involves destroying voice instances when no longer needed and removing event listeners to prevent memory leaks. Battery optimization can be achieved by limiting recognition duration and using on-device recognition when possible.
Conclusion
Building a speech-to-text dictation application in React Native involves understanding speech recognition technology, selecting appropriate libraries, implementing proper permissions handling, and creating intuitive user interfaces.
Key takeaways:
- react-native-voice provides a solid foundation for most use cases
- Advanced requirements may necessitate custom native module development
- Thorough error handling ensures production-quality experience
- Performance optimization prevents resource leaks
- Testing across diverse scenarios ensures reliable functionality
The voice interface landscape continues to evolve with improvements in on-device recognition, expanded language support, and more accurate transcription. React Native developers who master these concepts create compelling applications leveraging natural voice interaction.
For teams looking to integrate voice capabilities into their mobile applications, partnering with experienced React Native developers can accelerate development and ensure production-ready implementations that scale with user needs. Additionally, combining voice interfaces with comprehensive web development services creates cohesive digital experiences that work seamlessly across web and mobile platforms.