Health checks are critical infrastructure that many developers overlook until production fails. Without proper health monitoring, failed instances continue receiving traffic, deployments become risky, and problems remain hidden until users report them. This guide covers comprehensive health check implementation for production Node.js applications.
Well-implemented health checks enable zero-downtime deployments, automatic recovery from failures, and reliable load balancer routing. They transform your application from a black box into an observable system where problems are detected proactively rather than reactively.
In modern deployments, health checks determine whether your application receives production traffic. Load balancers poll health endpoints every few seconds, and a single failed check can remove an instance from rotation. Kubernetes uses health checks to decide when to restart containers and when to route traffic to pods. These mechanisms only work when health checks accurately reflect your application's true state.
Why Health Checks Matter
Health checks serve as the nervous system of production applications. They enable load balancers to route traffic only to healthy instances, allow container orchestrators to restart failed containers automatically, and provide the visibility needed for zero-downtime deployments. Without proper health checks, a single failing instance can degrade user experience, trigger cascading failures, and create blind spots in your monitoring stack.
The Cost of Skipping Health Checks
When health checks are missing or inadequate, the consequences manifest quickly in production. Failed instances continue receiving traffic, returning errors to users while your monitoring system remains unaware of the problem. Deployments become risky because you cannot verify new instances are functional before switching traffic. Database connections leak when containers are killed without graceful shutdown. External dependencies degrade silently as no mechanism exists to detect the degradation.
Consider a scenario where your primary database becomes slow to respond. Without dependency health checking, your application continues receiving requests, each waiting for database connections until timeout. Users experience slow responses or failures, but your health check returns 200 because the HTTP server runs normally. A comprehensive health check that verifies database responsiveness would detect the issue and enable automatic failover or alerting.
Health Check Architecture
Effective health check architecture follows a layered approach:
- Liveness checks verify the process runs and responds--they only determine whether a container should be restarted
- Readiness checks verify the instance can handle production traffic, including dependency connectivity
- Deep checks examine business logic health, external service status, and data consistency
This tiered approach ensures quick detection of process failures while providing nuanced visibility into application health.
Liveness vs Readiness Probes
Kubernetes and modern load balancers distinguish between liveness and readiness probes, each serving a distinct purpose in container lifecycle management. Understanding this distinction ensures your health checks enable proper container lifecycle management rather than causing unnecessary restarts or traffic routing to unprepared instances.
Liveness Probes: Detecting Process Failure
Liveness probes answer a simple question: should this container be restarted? When a liveness check fails, Kubernetes terminates and restarts the container. This makes liveness checks ideal for detecting process crashes, infinite loops, or deadlocks--but inappropriate for checking transient conditions like database connectivity that might recover on their own.
A proper liveness endpoint verifies the process runs and the event loop is responsive. The simplest implementation returns 200 immediately, proving the server accepts connections:
app.get('/health/live', (req, res) => {
res.status(200).json({
status: 'alive',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
However, this minimal check misses important failure modes. A more robust liveness check verifies the event loop progresses by tracking operation timestamps or executing a minimal asynchronous operation. This catches situations where the server accepts connections but cannot process requests due to blocked event loops or memory exhaustion.
Readiness Probes: Routing Traffic Decisions
Readiness probes answer a different question: should this instance receive production traffic? Unlike liveness failures that trigger restarts, readiness failures simply remove the instance from load balancer rotation. This distinction is crucial because it allows instances to recover from transient issues--like a temporarily unreachable database--without the disruption of container restart.
A comprehensive readiness check examines all dependencies required for request processing. This includes database connectivity, cache availability, external API reachability, and any synchronous resources the application requires. The check should be fast (typically under 5 seconds) and not modify system state:
app.get('/health/ready', async (req, res) => {
const checks = await Promise.all([
checkDatabase(),
checkCache(),
checkExternalServices()
]);
const allHealthy = checks.every(check => check.healthy);
if (allHealthy) {
res.status(200).json({ status: 'ready', checks });
} else {
res.status(503).json({ status: 'not_ready', checks });
}
});
The response from readiness checks provides operational visibility. Including check results in the response helps operators understand why an instance is not ready, accelerating troubleshooting. This diagnostic information proves invaluable during incidents when you need to quickly identify which dependency caused the failure.
Implementing comprehensive health checks is a foundational practice for production-ready web development services that prioritize reliability and operational excellence.
Basic Health Check Implementation
Building a health check system requires balancing simplicity against thoroughness. Start with a functional baseline, then add sophistication as your application evolves. The goal is an implementation that provides value immediately while remaining maintainable as your architecture grows.
Setting Up Health Check Endpoints
Express.js provides straightforward middleware for health check implementation. Organize health endpoints under a dedicated route to avoid conflicts with application routes and to simplify monitoring configuration:
const express = require('express');
const router = express.Router();
// Liveness: Process is running
router.get('/health/live', (req, res) => {
res.status(200).json({
status: 'alive',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
// Readiness: Can handle production traffic
router.get('/health/ready', async (req, res) => {
try {
const health = await performHealthChecks();
if (health.healthy) {
res.status(200).json(health);
} else {
res.status(503).json(health);
}
} catch (error) {
res.status(503).json({
healthy: false,
error: error.message,
timestamp: new Date().toISOString()
});
}
});
module.exports = router;
This structure separates concerns cleanly. The liveness endpoint remains simple and fast, suitable for frequent checking. The readiness endpoint performs comprehensive verification and provides detailed status information. Both endpoints include timestamps, enabling monitoring systems to track response times and detect delays.
Health Check Response Format
A well-designed response format serves both human operators and automated monitoring systems. Include sufficient detail for debugging while maintaining parseable structure for alerting tools:
{
"status": "healthy",
"timestamp": "2026-01-09T12:00:00Z",
"version": "1.2.3",
"checks": {
"database": {
"status": "healthy",
"latency_ms": 15,
"details": { "pool_size": 10, "active_connections": 3 }
},
"cache": {
"status": "healthy",
"latency_ms": 2
},
"external_api": {
"status": "degraded",
"latency_ms": 2500,
"warning": "Elevated response times detected"
}
}
}
The status field provides a quick signal for alerting systems--healthy instances return 200, unhealthy returns 503. The checks object provides detail for operators investigating issues. Including latency measurements enables capacity planning and performance monitoring. Version information helps verify deployment correctness during rollouts.
Checking Dependencies
Dependency checking transforms a basic health endpoint into a comprehensive operational tool. Without dependency verification, health checks create false confidence--returning 200 while the application cannot actually serve users. Each external resource your application requires should have a corresponding health check.
Database Connectivity Checks
Database checks must verify more than socket connectivity. A healthy database connection means the connection is alive and the database can execute queries within acceptable time bounds. Implement checks that execute lightweight queries to verify full database health:
async function checkDatabase() {
const startTime = Date.now();
try {
// Use a timeout to prevent hanging checks
const result = await db.query('SELECT 1 as health_check', [], {
timeout: 5000 // 5 second timeout
});
const latency = Date.now() - startTime;
// Verify query returned expected result
if (result.rows[0]?.health_check !== 1) {
return {
healthy: false,
error: 'Unexpected query result',
latency_ms: latency
};
}
// Check connection pool health
const poolStats = {
totalConnections: db.pool.totalCount,
idleConnections: db.pool.idleCount,
waitingRequests: db.pool.waitingCount
};
return {
healthy: true,
latency_ms: latency,
details: poolStats
};
} catch (error) {
return {
healthy: false,
error: error.message,
latency_ms: Date.now() - startTime
};
}
}
This implementation verifies query execution, enforces timeouts to prevent blocking, and provides connection pool visibility. The timeout is critical--without it, a slow database query blocks the entire health check, potentially causing cascading failures as load balancers mark healthy instances as failed.
Cache and External Service Verification
Cache systems like Redis require specialized health verification. Beyond basic connectivity, check that Redis can actually store and retrieve data:
async function checkRedis() {
try {
// Verify connectivity with PING
const pingResult = await redis.ping();
if (pingResult !== 'PONG') {
return { healthy: false, error: 'Unexpected PING response' };
}
// Verify read/write capability with small test operation
const testKey = '__health_check_' + Date.now();
await redis.set(testKey, 'test', 'EX', 10);
const value = await redis.get(testKey);
await redis.del(testKey);
if (value !== 'test') {
return { healthy: false, error: 'Read/write verification failed' };
}
return { healthy: true, latency_ms: Date.now() - startTime };
} catch (error) {
return { healthy: false, error: error.message };
}
}
External service dependencies require careful handling in health checks. You cannot afford to wait for slow or unavailable external services during health verification--timeout enforcement is critical:
async function checkExternalService(serviceUrl, timeoutMs = 3000) {
const startTime = Date.now();
try {
const response = await fetch(serviceUrl, {
method: 'HEAD',
timeout: timeoutMs
});
if (!response.ok) {
return {
healthy: false,
error: `HTTP ${response.status}`,
latency_ms: Date.now() - startTime
};
}
return {
healthy: true,
latency_ms: Date.now() - startTime,
details: { status: response.status }
};
} catch (error) {
return {
healthy: false,
error: error.name === 'TimeoutError' ? 'Timeout' : error.message,
latency_ms: Date.now() - startTime
};
}
}
When integrating health checks with external services and monitoring infrastructure, consider how AI automation services can help automate incident response and improve operational efficiency across your infrastructure.
Graceful Shutdown Handling
Graceful shutdown ensures your application handles deployment and scaling events without disrupting user requests. When Kubernetes sends SIGTERM or you manually stop the application, proper handling prevents connection drops, data loss, and inconsistent state. Health checks and shutdown handling work together--readiness should return not-ready immediately upon receiving termination signals.
Signal Handling Fundamentals
Linux sends SIGTERM as the standard termination signal, and SIGINT for interrupt (Ctrl+C). Kubernetes sends SIGTERM before terminating containers--your application typically has 30 seconds to shut down gracefully. Never perform long-running operations in signal handlers; instead, set flags that other code checks:
let isShuttingDown = false;
process.on('SIGTERM', () => {
console.log('Received SIGTERM, initiating graceful shutdown');
isShuttingDown = true;
// Immediately mark as not ready to stop receiving traffic
markNotReady();
// Begin graceful shutdown after a brief delay
setTimeout(() => gracefulShutdown(), 100);
});
process.on('SIGINT', () => {
console.log('Received SIGINT, initiating graceful shutdown');
isShuttingDown = true;
markNotReady();
setTimeout(() => gracefulShutdown(), 100);
});
function markNotReady() {
// If using a custom readiness flag, set it here
// This causes readiness probe to return 503
app.set('ready', false);
}
The key insight is marking the application as not-ready immediately upon receiving termination signals. This prevents new requests from arriving while shutdown proceeds.
Complete Shutdown Sequence
A complete shutdown sequence follows a specific order: stop accepting new requests, finish processing in-flight requests, close database connections, clear caches, and finally exit. Each step should have a timeout to prevent infinite waiting:
async function gracefulShutdown() {
console.log('Starting graceful shutdown sequence');
const shutdownTimeout = 25000; // 25 seconds (Kubernetes default is 30)
const startTime = Date.now();
try {
// Step 1: Close HTTP server (stops accepting new connections)
await new Promise((resolve) => {
server.close(() => {
console.log('HTTP server closed');
resolve();
});
});
// Step 2: Allow in-flight requests time to complete
await waitForPendingRequests({
timeout: shutdownTimeout - (Date.now() - startTime)
});
// Step 3: Close database connections
await closeDatabaseConnections({ timeout: 5000 });
// Step 4: Clear caches and close external connections
await closeExternalConnections({ timeout: 3000 });
console.log('Graceful shutdown completed successfully');
process.exit(0);
} catch (error) {
console.error('Error during shutdown:', error);
process.exit(1);
}
}
Tracking in-flight requests requires middleware that increments a counter on request start and decrements on completion. The shutdown function waits for this counter to reach zero before proceeding.
Kubernetes Integration
Kubernetes probe configuration determines how aggressively the platform checks health and responds to failures. Proper configuration prevents both false positives (marking healthy instances as failed) and delayed detection (allowing too many requests to failed instances).
Probe Configuration Best Practices
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 2
Key Configuration Parameters
- initialDelaySeconds: Allow startup time before checking--set based on actual startup duration
- periodSeconds: How often to check--more frequent for liveness, less for readiness
- timeoutSeconds: Must be shorter than your internal check timeout
- failureThreshold: Consecutive failures before action--prevents false positives
- terminationGracePeriodSeconds: Time allowed for graceful shutdown (usually 30s)
The startupProbe ensures slow-starting applications have time to initialize before liveness and readiness probes begin checking. This eliminates the race condition where Kubernetes restarts containers that haven't finished starting:
startupProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup
Proper Kubernetes integration with comprehensive health checks is essential for web development services that deliver production-ready, enterprise-grade applications with reliable deployment workflows.
Performance Optimization
Health checks should not impact application performance. Implement checks that execute quickly and in parallel to minimize response latency. A slow health check can trigger cascading failures if monitoring systems interpret high latency as failure.
Parallel Check Execution
Execute all dependency checks in parallel rather than sequentially. Sequential checking multiplies latency by the number of dependencies, while parallel checking adds only the slowest check's duration:
async function performHealthChecks() {
const startTime = Date.now();
const checks = await Promise.allSettled([
checkDatabase().timeout(5000),
checkCache().timeout(3000),
checkExternalServices().timeout(5000),
checkMessageQueue().timeout(3000)
]);
const results = {
database: checks[0],
cache: checks[1],
external: checks[2],
queue: checks[3]
};
// Determine overall health
const healthy = checks.every(check =>
check.status === 'fulfilled' && check.value.healthy
);
return {
healthy,
timestamp: new Date().toISOString(),
total_latency_ms: Date.now() - startTime,
checks: results
};
}
// Helper to add timeout to promises
Promise.prototype.timeout = function(ms) {
return Promise.race([
this,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Check timeout')), ms)
)
]);
};
Promise.allSettled ensures one failed check doesn't prevent others from completing. Individual timeouts prevent slow checks from blocking the entire health endpoint.
Result Caching
For expensive checks, cache results for a short duration while accepting slightly stale status information. This reduces load on checked dependencies while maintaining acceptable accuracy:
const healthCache = {
database: { value: null, expires: 0 },
cache: { value: null, expires: 0 }
};
async function getCachedCheck(checkName, checkFn, cacheMs = 10000) {
const now = Date.now();
const cached = healthCache[checkName];
if (cached && cached.expires > now) {
return cached.value;
}
const result = await checkFn();
healthCache[checkName] = { value: result, expires: now + cacheMs };
return result;
}
The cache duration should match your monitoring granularity. If your load balancer checks health every 5 seconds, cache for 10 seconds to prevent duplicate checks while maintaining freshness.
Implementation Essentials
Keep liveness checks simple and fast. Make readiness checks comprehensive. Always enforce timeouts. Return structured JSON. Execute checks in parallel.
Kubernetes Configuration
Set initialDelaySeconds based on actual startup time. Configure separate liveness and readiness probes. Use startupProbe for slow initialization.
Operational Integration
Monitor health check latency, not just pass/fail status. Alert on rising latency before complete failure. Include metrics in dashboards.
Common Mistakes to Avoid
Don't perform expensive operations in checks. Don't share health check infrastructure with failing application logic. Don't set timeouts too long.
Frequently Asked Questions
What's the difference between liveness and readiness probes?
Liveness probes determine if a container should be restarted (process crash, infinite loop). Readiness probes determine if an instance should receive traffic (dependency connectivity, capacity). Liveness failures restart containers; readiness failures only remove instances from load balancer rotation.
How long should health check timeouts be?
Health check timeouts should be short--typically 2-5 seconds. The timeout must be shorter than your internal dependency check timeout so that Kubernetes detects problems quickly. Setting timeouts too long delays failure detection and can cause cascading issues.
Should health checks verify database queries or just connectivity?
Health checks should execute lightweight queries to verify full database health, not just connectivity. A database connection can be open but the database unable to execute queries. Use SELECT 1 or similar lightweight queries with appropriate timeouts.
How often should health checks run?
Liveness probes typically check every 5-10 seconds. Readiness probes can check less frequently, every 10-30 seconds. More frequent checking provides faster failure detection but increases load.
How do health checks affect zero-downtime deployments?
Health checks enable zero-downtime deployments by verifying new instances are healthy before routing traffic. The sequence is: deploy new instance, health check passes, route traffic to new instance, drain old instance. Without health checks, you cannot safely verify new instances before switching traffic.
Sources
- LogRocket: How to implement a health check in Node.js - Comprehensive guide covering basic health checks, dependency checks, and timeout handling
- Express.js: Health Checks and Graceful Shutdown - Official documentation covering liveness/readiness probes, SIGTERM handling, and Kubernetes integration
- Kubernetes: Configure Liveness and Readiness Probes - Kubernetes probe types and configuration