Bash by Example: Daemon Heartbeat Monitor
Simulating a heartbeat mechanism to track service liveness by periodically updating timestamp files or sending heartbeat signals, implementing dead mans switch patterns for monitoring, detecting silent failures when heartbeats stop, building distributed system health checks, and ensuring service responsiveness validation.
Code
#!/bin/bash
HEARTBEAT_FILE="/tmp/daemon.heartbeat"
TIMEOUT_SECONDS=10
echo "Starting Heartbeat Monitor (Ctrl+C to stop)..."
# Function to simulate the daemon updating its heartbeat
simulate_daemon() {
while true; do
# Update timestamp in file
date +%s > "$HEARTBEAT_FILE"
# Daemon sleeps for random 1-5 seconds
sleep $(( ( RANDOM % 5 ) + 1 ))
done
}
# Start the "daemon" in background
simulate_daemon &
daemon_pid=$!
echo "Daemon started (PID: $daemon_pid). Updating heartbeat..."
# Monitor loop
while true; do
sleep 2
if [ ! -f "$HEARTBEAT_FILE" ]; then
echo "WARNING: No heartbeat file found!"
continue
fi
current_time=$(date +%s)
last_beat=$(cat "$HEARTBEAT_FILE")
diff=$((current_time - last_beat))
if [ "$diff" -gt "$TIMEOUT_SECONDS" ]; then
echo "CRITICAL: Heartbeat lost! Last beat was $diff seconds ago."
# Action: Restart daemon
kill $daemon_pid 2>/dev/null
simulate_daemon &
daemon_pid=$!
echo "Daemon restarted (PID: $daemon_pid)."
else
echo "Status OK. Last beat: $diff seconds ago."
fi
doneExplanation
In distributed systems and microservices, a "heartbeat" is a periodic signal sent by a service to indicate that it is still alive and functioning correctly. Unlike a simple process check (which only tells you if the PID exists), a heartbeat confirms that the internal logic of the application is not stuck in a deadlock or infinite loop.
This script demonstrates a file-based heartbeat mechanism. The "daemon" (simulated here as a background function) periodically writes the current Unix timestamp to a file. The monitor script checks this file at regular intervals. If the timestamp in the file gets too old (older than the defined timeout), the monitor assumes the daemon has hung or crashed and takes corrective action.
This pattern is extremely robust and is used by enterprise-grade tools like Pacemaker and Corosync. Implementing a simple version in Bash is a great way to ensure your custom background workers are actually doing work, not just sitting idle in the process table.
Code Breakdown
date +%s returns the number of seconds since the Unix Epoch (1970-01-01). This integer format makes time calculations (subtraction and comparison) trivial in Bash.$((current_time - last_beat)). Bash arithmetic expansion calculates the "lag". If this lag exceeds our tolerance, we trigger the failure logic.
