Bash by Example: Auto Restart Crashed Process
Detecting service failure and automatically attempting to restart it using systemctl or service commands, implementing self-healing automation for critical services, adding retry logic with exponential backoff, logging restart attempts and outcomes, and preventing infinite restart loops with maximum attempt limits.
Code
#!/bin/bash
CMD="sleep 10" # The command to keep running
NAME="sleep" # The name to search for
echo "Watchdog started for '$NAME'..."
while true; do
if ! pgrep -f "$NAME" > /dev/null; then
echo "$(date): Process $NAME is down. Restarting..."
# Run in background so watchdog loop continues
$CMD &
# Give it a moment to start up
sleep 1
new_pid=$!
echo "$(date): Restarted with PID $new_pid"
fi
sleep 3
doneExplanation
Beyond simply monitoring and alerting, a self-healing system attempts to fix problems automatically. This "Watchdog" pattern involves detecting that a process has crashed and immediately launching a new instance of it. This is commonly used for worker scripts, data consumers, or simple servers that might crash intermittently due to bugs or memory issues.
The logic is an inversion of the monitor script: if ! pgrep .... If the process is not found, we enter the recovery block. We execute the command again, usually placing it in the background with & so that our watchdog script doesn't get blocked waiting for the worker to finish. It needs to stay free to keep monitoring.
While tools like systemd, Supervisor, or Docker are better suited for managing production services, understanding this Bash logic is invaluable for ad-hoc tasks, user-level services, or environments where you don't have root access to configure system daemons.
Code Breakdown
pgrep -f searches the full command line, not just the process name. This is useful if you are running a script like python worker.py and want to distinguish it from python server.py.$CMD & executes the command string stored in the variable. The & is crucial; without it, the watchdog would pause and become the worker itself, stopping its monitoring duties until the worker died again.
