Build a Transcription App with React Native and OpenAI Whisper

Create a powerful voice-to-text mobile application using React Native for the cross-platform frontend and OpenAI's Whisper model for accurate speech recognition. Complete step-by-step implementation guide.

Introduction

Voice-to-text transcription has become an essential feature in modern mobile applications, enabling hands-free data entry, improving accessibility, and enhancing user productivity. From note-taking applications to voice-controlled interfaces, converting speech to text opens up new possibilities for mobile experiences.

In this comprehensive guide, you'll learn how to build a complete transcription application using React Native for the cross-platform mobile frontend and OpenAI's Whisper model for powerful, accurate speech recognition on the backend. This combination delivers enterprise-grade transcription capabilities while maintaining the development efficiency of JavaScript and Python ecosystems.

By the end of this tutorial, you'll have a working application that can record voice input from a device's microphone, send audio to a Python backend, process it with Whisper's state-of-the-art automatic speech recognition, and display the transcribed text in real-time.

Building voice-enabled applications represents a significant competitive advantage in today's market. Our mobile development team specializes in creating intelligent applications that leverage cutting-edge AI capabilities. Whether you're developing accessibility tools, productivity applications, or customer service solutions, we can help you build sophisticated voice interfaces that transform user interactions.

Prerequisites

Before diving into development, ensure you have the following tools and knowledge in place. This project requires coordination between multiple development environments and programming languages.

Required Software

Node.js (v14 or higher) and npm are essential for managing JavaScript dependencies in your React Native project. The modern JavaScript ecosystem requires a recent Node version to properly handle modern module formats and build tools.

Python 3.8+ is required for running the Flask backend and OpenAI's Whisper model. Python's extensive ecosystem for machine learning and audio processing makes it the ideal choice for transcription workloads. You'll use pip to install the necessary packages for audio handling and web serving.

React Native CLI provides full control over native modules and is recommended for this project since audio recording requires native device capabilities that some abstraction layers like Expo may limit. The CLI gives you direct access to native code configuration files.

Android Studio and Xcode are necessary for testing on Android and iOS emulators or physical devices respectively. These development environments provide the platform-specific tooling needed for debugging and deployment. Android Studio handles SDK configuration and emulator management, while Xcode provides iOS simulators and device provisioning capabilities.

Development Knowledge

You should have familiarity with React Native concepts including components, state management using hooks, and navigation patterns. Understanding JavaScript promises and asynchronous programming is crucial since audio processing involves many async operations that require proper error handling and state management.

Basic Python knowledge will help you understand the backend implementation and make modifications as needed. REST API concepts are essential for understanding how the mobile app communicates with the transcription server through HTTP requests and JSON responses.

Additional Tools

Git for version control is strongly recommended for tracking changes throughout the development process. Initialize your repository early and commit frequently to maintain a history of your implementation steps.

FFmpeg is required by Whisper for audio processing and format conversion, ensuring compatibility across different audio input formats. Install FFmpeg on your development machine before proceeding with the backend setup.

Project Architecture

The transcription application follows a clean client-server architecture that separates concerns between the mobile frontend and the transcription backend. This separation allows each component to be developed, tested, and scaled independently.

Frontend Architecture

The React Native mobile application serves as the user interface for all recording and result display operations. It handles microphone permission requests, audio capture using native device APIs, file management for recorded audio, HTTP communication with the backend API, and rendering of transcription results in an intuitive format.

The frontend is built using React Native's component-based architecture, with hooks managing the application state throughout the recording and upload process. This approach ensures clean code organization and makes it easier to maintain and extend functionality over time.

Backend Architecture

The Python Flask server provides a RESTful API endpoint for receiving audio uploads and returning transcriptions. It leverages OpenAI's Whisper model, an open-source automatic speech recognition system trained on 680,000 hours of multilingual data. The backend handles file uploads, audio processing, transcription inference, and response formatting.

Whisper offers multiple model sizes ranging from "tiny" (39 MB) to "large" (1.5 GB), allowing you to balance accuracy against processing speed and resource consumption based on your application's requirements. The base model provides an excellent starting point for most use cases.

Data Flow

The application follows a sequential data flow designed for reliability and user feedback:

  1. User initiates recording through the mobile app's UI
  2. Audio is captured and saved to local device storage
  3. App uploads the audio file to the Flask server via a multipart POST request
  4. Backend receives the file, validates the format, and runs Whisper transcription
  5. Server returns JSON containing the transcribed text and detected language
  6. App displays the transcription result to the user

This architecture supports both real-time transcription workflows and batch processing scenarios, depending on how you configure the audio upload timing and user interface feedback.

Setting Up the React Native Frontend

Creating the Project

Initialize a new React Native project using the Community CLI, which provides full native module support essential for audio recording functionality:

npx @react-native-community/cli@latest init VoiceToTextApp
cd VoiceToTextApp

The Community CLI is preferred over Expo for this project because audio recording functionality requires access to native device APIs that may be limited in Expo's managed workflow. With the CLI approach, you have direct access to native project files for iOS and Android configuration.

Installing Dependencies

Install the required packages for audio recording, permissions management, and HTTP communication:

npm install react-native-audio-recorder-player react-native-permissions axios
cd ios && pod install && cd ..

react-native-audio-recorder-player provides cross-platform audio recording capabilities with a simple API for starting, stopping, and managing recordings. This library handles the complexity of working with different audio formats and device-specific behaviors across iOS and Android.

react-native-permissions handles runtime permission requests for microphone access on both Android and iOS platforms. This package provides a unified interface for requesting permissions that works consistently across different OS versions and device manufacturers.

axios simplifies HTTP requests for uploading audio files to the transcription backend. Axios provides better error handling and automatic JSON transformation compared to the native fetch API.

Platform Configuration

iOS Configuration: Add the microphone usage description to your Info.plist file to comply with Apple's privacy requirements:

<key>NSMicrophoneUsageDescription</key>
<string>This app needs access to your microphone for voice recording and transcription.</string>

This message appears when the app first requests microphone access, so make it clear to users why the permission is needed. Apple reviews apps for appropriate permission usage descriptions.

Android Configuration: Add the required permissions to your AndroidManifest.xml:

<uses-permission android:name="android.permission.RECORD_AUDIO"/>
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE"/>
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE"/>

Modern Android versions require runtime permission requests in addition to these manifest declarations, which we'll handle in the application code using the react-native-permissions package.

Implementing Audio Recording

Initializing the Audio Recorder

Import and initialize the audio recorder player in your React Native component:

import React, { useState, useEffect } from 'react';
import { View, Text, Button, PermissionsAndroid, Platform, StyleSheet } from 'react-native';
import AudioRecorderPlayer from 'react-native-audio-recorder-player';

const audioRecorderPlayer = new AudioRecorderPlayer();

The audio recorder is initialized once and reused throughout the component lifecycle. This singleton pattern prevents resource conflicts and ensures consistent behavior across multiple recording sessions.

Requesting Microphone Permissions

Implement runtime permission requests for microphone access. Android requires explicit permission prompts while iOS checks the Info.plist configuration at runtime:

const requestPermissions = async () => {
 if (Platform.OS === 'android') {
 const granted = await PermissionsAndroid.request(
 PermissionsAndroid.PERMISSIONS.RECORD_AUDIO,
 {
 title: 'Microphone Permission',
 message: 'This app needs microphone access to record voice for transcription.',
 buttonPositive: 'OK',
 }
 );
 return granted === PermissionsAndroid.RESULTS.GRANTED;
 }
 // iOS handles permission via Info.plist
 return true;
};

This function checks the platform and requests appropriate permissions. On Android, it waits for user response while on iOS, the permission status is determined during app installation based on the Info.plist entries.

Recording Controls

Implement start and stop recording functionality with proper state management:

const [isRecording, setIsRecording] = useState(false);
const [audioPath, setAudioPath] = useState(null);
const [recordingDuration, setRecordingDuration] = useState(0);

const startRecording = async () => {
 const hasPermission = await requestPermissions();
 if (!hasPermission) {
 Alert.alert('Permission Required', 'Microphone access is needed for recording.');
 return;
 }

 const result = await audioRecorderPlayer.startRecorder();
 audioRecorderPlayer.addRecordBackListener((e) => {
 setRecordingDuration(e.currentPosition);
 });

 setIsRecording(true);
 setAudioPath(result);
};

const stopRecording = async () => {
 const result = await audioRecorderPlayer.stopRecorder();
 audioRecorderPlayer.removeRecordBackListener();
 setIsRecording(false);
 return result;
};

The recorder returns a file path where the audio is stored, which you'll use for both playback and upload to the transcription backend. The record back listener provides real-time updates on recording progress, enabling features like duration displays or audio level visualizations.

Setting Up the Python Backend with Whisper

Backend Project Setup

Create a Python virtual environment and install the required packages:

mkdir whisper-backend
cd whisper-backend
python3 -m venv venv
source venv/bin/activate # macOS/Linux
pip install flask flask-cors openai-whisper

Flask provides the lightweight web server framework for handling HTTP requests with minimal boilerplate code. Flask's simplicity makes it ideal for focused APIs like transcription endpoints.

Flask-CORS enables Cross-Origin Resource Sharing, allowing your mobile app to communicate with the backend without browser security restrictions. This is essential when the mobile app and backend run on different origins during development.

OpenAI Whisper is the core transcription engine, offering industry-leading accuracy across multiple languages. Whisper is open-source and free to use, making it an excellent choice for production applications without per-transcription API costs.

This architecture demonstrates the power of combining modern mobile development with AI-powered services to deliver intelligent capabilities that solve real user problems.

Flask Application Implementation

Create the Flask application with transcription endpoint:

from flask import Flask, request, jsonify
from flask_cors import CORS
import whisper
import os
import uuid

app = Flask(__name__)
CORS(app)

# Load Whisper model - choose based on accuracy/speed needs
# Options: tiny, base, small, medium, large
model = whisper.load_model("base")

UPLOAD_FOLDER = "uploads"
os.makedirs(UPLOAD_FOLDER, exist_ok=True)

@app.route("/transcribe", methods=["POST"])
def transcribe_audio():
 if "audio" not in request.files:
 return jsonify({"error": "No audio file provided"}), 400

 audio = request.files["audio"]
 extension = os.path.splitext(audio.filename)[1]
 filename = f"{uuid.uuid4().hex}{extension}"
 filepath = os.path.join(UPLOAD_FOLDER, filename)

 try:
 audio.save(filepath)
 result = model.transcribe(filepath)
 transcription = result.get("text", "").strip()

 return jsonify({
 "transcription": transcription,
 "language": result.get("language", "unknown")
 })
 except Exception as e:
 return jsonify({"error": str(e)}), 500
 finally:
 if os.path.exists(filepath):
 os.remove(filepath)

if __name__ == "__main__":
 app.run(host="0.0.0.0", port=5000, debug=True)

The backend loads the Whisper model once at startup for efficiency. Each model size offers different accuracy and speed trade-offs: "tiny" is fastest but least accurate, while "large" provides the highest accuracy at the cost of longer processing times.

Running the Server

python app.py

For testing with physical devices on your local network, expose your local server using ngrok:

ngrok http 5000

Ngrok creates a public URL that your mobile app can access, bypassing local network restrictions that might prevent direct communication between the device and your development machine.

Integrating Frontend with Backend

Audio Upload Function

Implement the function to upload recorded audio to the transcription server:

const uploadAudioForTranscription = async (audioUri) => {
 const formData = new FormData();

 const filename = audioUri.split('/').pop() || 'recording.wav';
 const fileType = 'audio/wav';

 formData.append('audio', {
 uri: Platform.OS === 'android' ? audioUri : audioUri.replace('file://', ''),
 name: filename,
 type: fileType,
 });

 try {
 const response = await axios.post(
 'http://YOUR_SERVER_IP:5000/transcribe',
 formData,
 { 
 headers: { 'Content-Type': 'multipart/form-data' },
 timeout: 30000, // 30 second timeout
 }
 );
 return response.data;
 } catch (error) {
 console.error('Transcription error:', error.response?.data || error.message);
 throw new Error('Transcription failed');
 }
};

The upload function handles platform-specific file path differences and wraps the axios request in error handling for network failures. The timeout prevents the app from hanging indefinitely on large audio files.

Complete Recording and Transcription Flow

Connect the recording and upload functionality for a seamless user experience:

const [isTranscribing, setIsTranscribing] = useState(false);
const [transcription, setTranscription] = useState(null);
const [transcriptionError, setTranscriptionError] = useState(null);

const handleStopRecording = async () => {
 const recordedPath = await stopRecording();

 if (recordedPath) {
 setIsTranscribing(true);
 setTranscriptionError(null);

 try {
 const result = await uploadAudioForTranscription(recordedPath);
 setTranscription(result.transcription);
 } catch (error) {
 setTranscriptionError(error.message);
 } finally {
 setIsTranscribing(false);
 }
 }
};

Displaying Results

Present the transcription in a clean, readable format with appropriate feedback states:

{isTranscribing && (
 <View style={styles.loadingContainer}>
 <ActivityIndicator size="large" color="#4CAF50" />
 <Text>Transcribing audio...</Text>
 </View>
)}

{transcription && (
 <View style={styles.transcriptionContainer}>
 <Text style={styles.sectionTitle}>Transcription:</Text>
 <View style={styles.transcriptionBox}>
 <Text style={styles.transcriptionText}>{transcription}</Text>
 </View>
 </View>
)}

{transcriptionError && (
 <View style={styles.errorContainer}>
 <Text style={styles.errorText}>Error: {transcriptionError}</Text>
 </View>
)}

This UI pattern provides clear feedback at each stage: loading during transcription, displaying results when available, and showing errors when something goes wrong. The visual hierarchy helps users understand the current state of their transcription request.

Testing and Troubleshooting

Testing on Physical Devices

For accurate audio recording tests, use physical devices rather than emulators. Emulators simulate audio input and may not accurately represent real-world recording quality or performance characteristics.

Android: Enable Developer Mode and USB Debugging on your device, connect via USB, then run npx react-native run-android from your project directory. Check that the device is recognized with adb devices before running the command.

iOS: Open the project in Xcode by running open ios/VoiceToTextApp.xcworkspace, select your physical device, and build. Ensure microphone permissions are configured correctly in Info.plist before testing. Physical iOS device testing requires an Apple Developer account.

Common Issues and Solutions

IssueCauseSolution
Audio not uploadingInvalid file URIVerify path format for each platform; Android and iOS use different URI schemes
Network errorsServer not reachableUse ngrok for physical device testing; check firewall settings
Empty transcriptionIncompatible audio formatEnsure audio is WAV format; install FFmpeg for format conversion
Slow resultsLarge Whisper modelUse "tiny" or "base" model instead of "medium" or "large"
Permission deniedMissing configurationCheck Info.plist for iOS and AndroidManifest.xml for Android
Timeout errorsLarge audio fileIncrease timeout settings or reduce recording length

Performance Optimization

For faster transcription response times, consider the following optimizations:

Use the smaller Whisper models ("tiny" or "base") instead of "medium" or "large" when processing speed is more important than maximum accuracy. The base model provides a good balance for real-time application scenarios.

Limit recordings to 30-60 seconds for a responsive user experience. Long recordings increase upload times and processing duration, which can frustrate users waiting for results.

Implement proper error handling with retry logic for network failures. Users should be able to retry failed uploads without losing their recorded audio.

Consider implementing audio compression before upload if file size becomes an issue. However, be aware that aggressive compression may affect transcription accuracy.

Optional Enhancements

Multi-Language Support

Whisper supports multiple languages with automatic language detection. You can also specify a language explicitly for improved accuracy on known language content:

# Transcribe with automatic language detection
result = model.transcribe(filepath)

# Or specify language explicitly for better accuracy
result = model.transcribe(filepath, language="es") # Spanish
result = model.transcribe(filepath, language="fr") # French

Save Transcriptions to Database

Persist transcription history using SQLite for local storage or integrate with a cloud database:

import sqlite3

def save_transcription(text, filename):
 conn = sqlite3.connect('transcriptions.db')
 cursor = conn.cursor()
 cursor.execute("CREATE TABLE IF NOT EXISTS history (id TEXT, text TEXT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP)")
 cursor.execute("INSERT INTO history (id, text) VALUES (?, ?)", (filename, text))
 conn.commit()
 conn.close()

Audio Playback

Allow users to replay their recordings before or after transcription:

import Sound from 'react-native-sound';

const playRecording = (audioPath) => {
 const sound = new Sound(audioPath, Sound.MAIN_BUNDLE, (error) => {
 if (error) {
 console.log('Failed to load sound', error);
 return;
 }
 sound.play((success) => {
 if (success) {
 console.log('Playback complete');
 }
 });
 });
};

Cloud Deployment

Deploy the Flask backend to production using modern deployment platforms:

  • Render.com - Simple Flask deployment with automatic HTTPS
  • Railway.app - Easy container-based deployment with scalable infrastructure
  • Docker - Containerize with Dockerfile for deployment to AWS, DigitalOcean, or any cloud provider
  • Heroku - Traditional PaaS deployment with straightforward configuration

Containerization with Docker ensures consistent behavior across development and production environments. Create a Dockerfile that installs all dependencies and runs the Flask application.

Summary

You've successfully built a complete voice-to-text transcription application using React Native and OpenAI Whisper. The application demonstrates how to:

  • Record audio from the device's microphone using native APIs with proper permission handling
  • Handle platform-specific permissions for audio recording on both iOS and Android
  • Build a Python Flask backend for audio processing and transcription
  • Implement accurate transcription using OpenAI's Whisper model with configurable model sizes
  • Integrate the frontend and backend through REST APIs with proper error handling
  • Display transcription results in a user-friendly interface with loading states

This foundation can be extended with additional features to create production-ready applications:

Real-time transcription streaming can be implemented by modifying the backend to process audio chunks as they are uploaded, providing immediate feedback to users during longer recording sessions.

Speaker diarization identifies different speakers in recordings, useful for meeting transcription applications where multiple participants need to be distinguished.

Custom vocabulary allows Whisper to recognize domain-specific terminology, improving accuracy for technical fields, medical applications, or industry-specific jargon.

Export functionality enables users to save transcriptions in various formats including plain text, PDF, or integration with cloud storage services like Google Drive or Dropbox.

Voice-powered interfaces continue to grow in importance for mobile applications, and the skills developed in this project position you to build sophisticated voice-enabled experiences. The combination of React Native for cross-platform mobile development with intelligent features like Whisper transcription provides a powerful foundation for any application requiring speech-to-text capabilities.

For teams looking to implement AI-powered features in their applications, our AI automation services can help you integrate advanced capabilities like natural language processing, computer vision, and intelligent automation into your existing workflows.

Frequently Asked Questions

Ready to Build Voice-Enabled Applications?

Our team specializes in developing intelligent mobile applications with cutting-edge AI features including speech recognition, natural language processing, and voice interfaces. From initial concept through deployment, we help you bring your vision to life with modern technologies and proven development practices.

Sources

  1. LogRocket: Build Transcription App React Native - React Native Voice library implementation details and patterns
  2. Picovoice: React Native Speech Recognition 2025 - Complete guide to speech recognition options in React Native including cloud and on-device approaches
  3. Djamware: Voice-to-Text with Whisper and React Native - Full-stack implementation tutorial with Flask and Whisper integration
  4. React Native Voice Library - Official documentation for speech recognition capabilities in React Native
  5. OpenAI Whisper GitHub - Open-source automatic speech recognition model documentation and model specifications