Retry

Master the retry pattern in JavaScript to build resilient web applications that gracefully handle transient failures, network issues, and service unavailability.

Why Retries Matter in Web Development

When you're building web applications with Next.js, React, or any modern JavaScript framework, your code constantly communicates with external services. API calls to payment processors, database queries, third-party integrations, and microservice communications all travel across networks that are inherently unreliable. The retry pattern gives your application the ability to automatically recover from these transient failures, transforming unreliable networks into reliable user experiences.

But here's the catch: retries are both essential and dangerous. Done right, they transform frustrating failures into seamless successes. Done wrong, they amplify latency, exhaust resources, and cascade into system-wide outages.

The key to balance lies in understanding that retries trade a small amount of latency for dramatically improved availability. Instead of showing users a failure message when a request hits a momentary issue, your application can automatically recover. However, without proper safeguards, automatic retries can transform a small problem into a system-wide catastrophe known as a "retry storm" Microsoft Learn's Retry Storm Antipattern guide. When a service experiences difficulty, clients send requests that fail, triggering immediate retries. Each retry adds more load to an already-struggling service, causing it to slow down further and trigger more retries. Within seconds, a minor issue becomes a cascading failure affecting the entire platform.

This is why naive retry implementations--like an infinite loop that retries immediately on every failure--are dangerous. The solution isn't to avoid retries entirely; it's to implement them with proper safeguards like exponential backoff, jitter, and circuit breakers, which we'll explore in this guide. These patterns work hand-in-hand with proper async function design to create applications that handle failures gracefully.

Basic Retry Implementation
1async function fetchWithRetry(url, options = {}, maxRetries = 3) {2 let lastError;3 4 for (let attempt = 0; attempt <= maxRetries; attempt++) {5 try {6 const response = await fetch(url, options);7 if (!response.ok) {8 throw new Error(`HTTP ${response.status}`);9 }10 return response.json();11 } catch (error) {12 lastError = error;13 if (attempt < maxRetries) {14 const delay = Math.pow(2, attempt) * 100;15 await new Promise(resolve => setTimeout(resolve, delay));16 }17 }18 }19 20 throw lastError;21}

When to Retry: Understanding Error Types

Not every error warrants a retry. Understanding which errors are transient (and likely to succeed on retry) versus which are permanent (and will fail regardless of attempts) is crucial for building robust applications. This classification is a key part of writing robust API references that handle edge cases gracefully.

Retriable Errors

Network timeouts occur when a request takes longer than expected and the connection is terminated. A gateway timeout (HTTP 504) indicates the server acting as a gateway couldn't receive a timely response. Service unavailable errors (HTTP 503) typically mean the server is temporarily overloaded or under maintenance. Rate limiting responses (HTTP 429) signal you've exceeded allowed request frequency for a given window. Connection resets happen when the remote end unexpectedly closes the connection. These errors typically indicate temporary conditions that may resolve quickly Microsoft Learn's Retry Storm Antipattern guide.

Non-Retriable Errors

Authentication failures (HTTP 401) indicate invalid or missing credentials that won't change on retry. Forbidden access errors (HTTP 403) mean the user lacks permission regardless of attempts. Malformed request errors (HTTP 400) indicate client-side bugs that require code fixes. Server errors (HTTP 500) that indicate bugs rather than transient issues should not be retried. Retrying these only wastes resources and delays presenting the actual problem to users or operators.

The key principle: only retry when you have reason to believe the error is temporary and retrying might succeed. Implement error classification in your retry logic to distinguish between retriable and non-retriable conditions Microsoft Learn's Retry Storm Antipattern guide.

Exponential Backoff: The Foundation of Safe Retries

Exponential backoff is the most important technique for safe retry implementation. Instead of retrying immediately after a failure, you wait progressively longer between attempts. First retry after 100ms, then 200ms, then 400ms, and so on.

This approach accomplishes several things simultaneously. It gives temporary issues time to resolve by spacing out requests. It prevents overwhelming struggling services by reducing load during their recovery period. It distributes retry load over time rather than creating sharp spikes that could worsen an outage. A basic exponential backoff implementation provides a good balance between recovery chances and restraint, giving services time to recover while keeping total retry time manageable Microsoft Learn's Retry Storm Antipattern guide.

The mathematical progression follows a simple formula: delay = baseDelay * 2^attempt. With a 100ms base delay, retries occur at approximately 100ms, 200ms, 400ms, 800ms, and so on. This geometric progression ensures that each subsequent retry has more time to succeed while keeping the total retry window within reasonable bounds for user-facing operations. The key insight is that later retries should be spaced further apart, as persistent failures likely indicate more serious issues that need more time to resolve.

Adding Jitter to Prevent Thundering Herds

Even with exponential backoff, multiple clients retrying simultaneously can create synchronized load patterns. When a service recovers, it faces a wall of requests from all clients whose timers happened to align. This is called the "thundering herd" problem, and it can cause a recovering service to immediately fail again under the sudden load.

Adding jitter--randomized delay variation--disperses retry attempts across a time window, preventing synchronization. The most common approach is "full jitter" or "50% jitter," which adds a random value between 0 and the calculated delay to each retry attempt. For example, if the exponential backoff calculates 400ms, jitter adds 0-200ms, resulting in a total delay between 400-600ms.

async function fetchWithJitteredBackoff(url, options = {}, maxRetries = 3, baseDelay = 100) {
 let lastError;

 for (let attempt = 0; attempt <= maxRetries; attempt++) {
 try {
 const response = await fetch(url, options);
 if (!response.ok) {
 throw new Error(`HTTP ${response.status}`);
 }
 return response.json();
 } catch (error) {
 lastError = error;
 if (attempt < maxRetries) {
 const exponentialDelay = Math.pow(2, attempt) * baseDelay;
 const jitter = Math.random() * exponentialDelay * 0.5;
 const totalDelay = exponentialDelay + jitter;
 await new Promise(resolve => setTimeout(resolve, totalDelay));
 }
 }
 }

 throw lastError;
}

The jitter window scales with the exponential delay, ensuring that later retries are more dispersed than early ones. This prevents the thundering herd effect while still providing effective backoff Microsoft Learn's Retry Storm Antipattern guide. The randomization breaks the synchronization that occurs when multiple clients use identical backoff algorithms.

The Circuit Breaker Pattern: Preventing Cascading Failures

While exponential backoff and jitter help, they don't fully protect against cascading failures. The circuit breaker pattern adds another layer of protection by detecting when a service is truly unhealthy and stopping requests entirely until it recovers Microsoft Learn's Retry Storm Antipattern guide.

The circuit breaker works like an electrical circuit: when too many failures occur, it "trips" and stops conducting current. Similarly, when too many requests to a service fail, the circuit breaker opens and immediately fails new requests without even trying the network call. After a cooling-off period, it allows a single "probe" request through. If that succeeds, the circuit closes and normal operation resumes.

Three States

Closed state represents normal operation. Requests flow through to the service, and failures are counted. When failures reach a threshold, the circuit transitions to open state.

Open state triggers after threshold failures are reached. All requests fail immediately without attempting the network call, immediately returning an error to the caller. This prevents further load on the struggling service.

Half-open state allows a limited number of probe requests through to test if the service has recovered. If probes succeed, the circuit closes. If they fail, it returns to open state and the cooling-off period restarts.

class CircuitBreaker {
 constructor(failureThreshold = 5, resetTimeout = 30000) {
 this.failureThreshold = failureThreshold;
 this.resetTimeout = resetTimeout;
 this.state = 'closed';
 this.failures = 0;
 this.lastFailure = null;
 }

 async execute(request) {
 if (this.state === 'open') {
 if (Date.now() - this.lastFailure > this.resetTimeout) {
 this.state = 'half-open';
 } else {
 throw new Error('Circuit breaker is open');
 }
 }

 try {
 const result = await request();
 this.onSuccess();
 return result;
 } catch (error) {
 this.onFailure();
 throw error;
 }
 }

 onSuccess() {
 this.failures = 0;
 this.state = 'closed';
 }

 onFailure() {
 this.failures++;
 this.lastFailure = Date.now();
 if (this.failures >= this.failureThreshold) {
 this.state = 'open';
 }
 }
}

Combining circuit breakers with retry logic creates a robust system: retries handle transient failures, while circuit breakers prevent sustained hammering of genuinely unhealthy services Microsoft Learn's Retry Storm Antipattern guide.

Handling HTTP 429 Rate Limiting Gracefully

Rate limiting is a common scenario where proper retry handling is essential. APIs return HTTP 429 (Too Many Requests) to indicate you've exceeded allowed request frequency. Many APIs include a Retry-After header indicating how long to wait before trying again Microsoft Learn's Retry Storm Antipattern guide.

When you receive a 429 response, check for the Retry-After header first. This header may contain either a number of seconds to wait or a specific timestamp. Parsing this header correctly ensures you wait the appropriate amount of time before retrying:

async function fetchWithRateLimitHandling(url, options = {}) {
 const response = await fetch(url, options);

 if (response.status === 429) {
 const retryAfter = response.headers.get('Retry-After');
 let delay = 1000;
 
 if (retryAfter) {
 if (retryAfter.includes(' ')) {
 delay = new Date(retryAfter).getTime() - Date.now();
 } else {
 delay = parseInt(retryAfter, 10) * 1000;
 }
 }

 console.log(`Rate limited. Waiting ${delay}ms before retry...`);
 await new Promise(resolve => setTimeout(resolve, delay));
 return fetchWithRateLimitHandling(url, options);
 }

 if (!response.ok) {
 throw new Error(`HTTP ${response.status}`);
 }

 return response.json();
}

Honoring the Retry-After header demonstrates respect for the service's throttling mechanism. It helps maintain good relationships with API providers and prevents your IP from being blocked for repeated violations. Many services track clients who consistently ignore rate limits and may impose harsher restrictions or temporary bans. By properly handling rate limiting, you show API providers that you're a responsible consumer of their services.

Retry in Next.js: Server Actions and API Routes

Next.js applications have specific patterns for handling retries depending on where your code runs. Understanding these patterns helps you implement consistent retry behavior across your entire application. For server actions, you can combine retry logic with Next.js action patterns to create robust form submissions.

Server Actions

Server actions run on the server and can communicate with databases, third-party APIs, and other services. You can wrap async operations with retry logic using a reusable helper function:

'use server'

async function withRetry<T>(
 fn: () => Promise<T>,
 options: { maxRetries?: number; baseDelay?: number } = {}
): Promise<T> {
 const { maxRetries = 3, baseDelay = 200 } = options;
 let lastError: Error;

 for (let attempt = 0; attempt <= maxRetries; attempt++) {
 try {
 return await fn();
 } catch (error) {
 lastError = error as Error;
 if (attempt < maxRetries) {
 const delay = Math.pow(2, attempt) * baseDelay;
 await new Promise(resolve => setTimeout(resolve, delay));
 }
 }
 }

 throw lastError;
}

async function submitOrder(orderData: OrderData) {
 return withRetry(async () => {
 const order = await createOrder(orderData);
 await sendConfirmationEmail(order);
 return order;
 }, { maxRetries: 3, baseDelay: 200 });
}

API Routes

API routes in the App Router handle requests at the edge and can implement retry logic for downstream service calls:

// app/api/orders/route.ts
import { withRetry } from '@/lib/retry';

export async function POST(request: Request) {
 try {
 const data = await request.json();
 
 const result = await withRetry(() => processOrder(data), {
 maxRetries: 3,
 baseDelay: 100
 });
 
 return Response.json(result);
 } catch (error) {
 console.error('Order processing failed:', error);
 return Response.json(
 { error: 'Failed to process order' },
 { status: 500 }
 );
 }
}

By centralizing retry logic in shared utilities, you ensure consistent behavior across server actions, API routes, and any other service-to-service communication in your Next.js application.

Performance Considerations for Retry Logic

Every retry adds latency to the user experience, even with exponential backoff. Poorly configured retry logic can turn a fast failure into a slow failure, which is often worse than failing immediately. Consider these strategies for minimizing retry impact on your application's performance.

Set appropriate maximum retries. For user-facing operations, 2-3 retries with a maximum total retry time of 5-10 seconds is usually sufficient. Background jobs can be more aggressive since latency matters less for deferred processing. Interactive operations should fail quickly to avoid frustrating users with long-waiting buttons or spinners.

Use timeouts alongside retries. A request that will never succeed should fail fast rather than consuming retries. Implement a per-request timeout that aborts individual attempts before the retry delay kicks in. This prevents slow requests from blocking retries that might succeed quickly on a new connection.

Monitor retry metrics. Track retry rates, success rates on retries, and total retry latency. Spikes in these metrics often indicate underlying problems with services you're calling. If 80% of your payment API calls are retried, something is wrong with either your implementation or the service itself.

Consider context-aware retry limits. Interactive operations might have shorter retry windows than background processing. A checkout button retrying 10 times over 30 seconds creates a poor user experience; a background sync job doing the same is reasonable Microsoft Learn's Retry Storm Antipattern guide. Differentiate between synchronous user actions and asynchronous background work when configuring retry parameters.

Profile retry overhead. Measure the additional latency retries introduce under various failure scenarios. If exponential backoff with 3 retries adds 7 seconds to every failed request, you may need to adjust base delays or maximum retries to balance reliability with responsiveness.

Common Anti-Patterns to Avoid

Several common mistakes undermine retry effectiveness and can introduce new problems into your application. Understanding these anti-patterns helps you avoid the pitfalls that turn useful retry logic into problematic behavior.

Infinite retry loops without limits will eventually consume all resources and prevent recovery. Without maximum retry counts, a persistently failing service will receive unlimited requests from your application. This wastes resources on your side and may worsen the downstream service's problems. Always set explicit maximum retry counts in your retry implementation.

Immediate retries without any delay simply duplicate failed requests. At best, this adds load without giving issues time to resolve. At worst, it can trigger rate limiting or DOS-like conditions against the service you're trying to reach. Even a small delay between retries significantly improves behavior and prevents synchronized retry storms.

Retrying non-retriable errors wastes resources and delays error reporting. A 400 Bad Request won't succeed on retry because the request itself is malformed. Implementing proper error classification prevents wasted effort and ensures users see meaningful error messages promptly rather than watching retries fail repeatedly.

Ignoring retry-after headers from rate-limited services shows poor API etiquette and may result in your IP being blocked. The Retry-After header exists specifically to help clients behave responsibly. Disregarding it demonstrates that you're not a respectful API consumer, which can lead to harsher rate limiting or access revocation.

Retrying on the same connection after certain failures may not work if the underlying connection is broken. Some network failures corrupt connection state, and reusing the connection simply propagates the failure. Implementing connection recreation on retry for failed connections can improve success rates for certain types of failures.

Testing Retry Logic

Testing retry behavior requires simulating failures, which can be challenging since you need deterministic control over when failures occur. Modern testing approaches make this achievable with proper mocking and verification strategies.

Mock the fetch function to return failures for specific test cases. Using Jest or Vitest, you can replace the global fetch with a mock that returns specific status codes on schedule. This allows you to verify that retries trigger on the right conditions and that the correct number of attempts occur:

const mockFetch = vi.fn();

beforeEach(() => {
 mockFetch.mockReset();
});

it('retries 3 times on 503 errors', async () => {
 mockFetch.mockRejectedValue(new Error('Network error'));
 
 await expect(fetchWithRetry('/api/data', { fetch: mockFetch }))
 .rejects.toThrow('Network error');
 
 expect(mockFetch).toHaveBeenCalledTimes(4);
});

Use tools like MSW (Mock Service Worker) to intercept requests at the network level and simulate various error conditions without modifying application code. MSW can return specific HTTP status codes, delay responses, and set headers like Retry-After to test rate limiting behavior comprehensively.

Test circuit breaker state transitions by deliberately causing failures and verifying the circuit opens and closes at appropriate times. Write tests that verify the breaker trips after N failures, rejects requests in open state, and allows recovery in half-open state:

it('opens circuit after 5 failures', async () => {
 const breaker = new CircuitBreaker(5, 1000);
 
 for (let i = 0; i < 5; i++) {
 await expect(breaker.execute(() => Promise.reject())).rejects.toThrow();
 }
 
 await expect(breaker.execute(() => Promise.resolve())).rejects.toThrow('Circuit breaker is open');
});

Measure total retry time under various failure scenarios to ensure it meets your latency requirements. Tests should verify that exponential backoff produces reasonable total delays and that jitter actually disperses timing across iterations. Use fake timers carefully since they interact with real-time delays in retry logic.

Test non-retriable error paths to ensure errors that shouldn't retry don't trigger retry behavior. This prevents accidentally retrying authentication failures or malformed requests, which would delay proper error handling.

Conclusion

The retry pattern is essential for building reliable web applications that communicate with external services. When implemented correctly--with exponential backoff, jitter, circuit breakers, and proper error classification--retries transform transient failures into seamless user experiences. When implemented carelessly, they amplify problems and create cascading failures.

Start with conservative retry settings (2-3 retries, base delays of 100-500ms), add jitter to prevent synchronization, and implement circuit breakers for critical dependencies. Monitor your retry patterns in production and adjust as you learn how services you depend on actually behave.

The goal is a system that gracefully handles the inherent unreliability of networks while remaining responsive and resource-efficient. That balance is what separates robust applications from fragile ones. By treating retries as a reliability engineering problem rather than an afterthought, you build applications that serve users reliably even when the underlying infrastructure experiences issues.

Implementing robust error handling patterns like retries connects directly to our broader approach to web development services, where reliability and user experience are paramount. These patterns work alongside proper API design practices and asynchronous programming patterns to create applications that perform consistently under real-world conditions.

Frequently Asked Questions

Build Resilient Web Applications

Need help implementing robust error handling patterns in your Next.js application? Our team specializes in building reliable, production-ready web applications with proper retry logic, circuit breakers, and graceful failure handling.

Sources

  1. Microsoft Learn: Retry Storm Antipattern - Comprehensive guide on preventing retry storms in cloud applications, covering client-side strategies and server-side protections.
  2. ByteByteGo: A Guide to Retry Pattern in Distributed Systems - In-depth exploration of retry patterns in distributed systems and the Fallacies of Distributed Computing.
  3. MDN Web Docs: Using Fetch - Official reference for JavaScript fetch API implementation and error handling patterns.