BudiBadu Logo
Samplebadu

Bash by Example: File Sync with Checksum

Bash 5.0+

Verifying file integrity and synchronizing directories based on content hashes using md5sum or sha256sum for checksum comparison, implementing change detection by comparing hash values, automating file verification processes, detecting corruption or unauthorized modifications, and ensuring data consistency across backup systems.

Code

#!/bin/bash

SOURCE_DIR="source_data"
DEST_DIR="mirror_data"

mkdir -p "$SOURCE_DIR" "$DEST_DIR"
# Create some dummy files
echo "Content A" > "$SOURCE_DIR/file1.txt"
echo "Content B" > "$SOURCE_DIR/file2.txt"
# Create a stale copy in dest
echo "Old Content" > "$DEST_DIR/file1.txt"

echo "Synchronizing $SOURCE_DIR to $DEST_DIR..."

# Iterate through source files
for file in "$SOURCE_DIR"/*; do
    filename=$(basename "$file")
    dest_file="$DEST_DIR/$filename"

    if [ -f "$dest_file" ]; then
        # Calculate checksums
        # md5sum outputs: "HASH  filename"
        sum1=$(md5sum "$file" | awk '{print $1}')
        sum2=$(md5sum "$dest_file" | awk '{print $1}')

        if [ "$sum1" != "$sum2" ]; then
            echo "Update detected: $filename"
            cp "$file" "$dest_file"
        else
            echo "Skipping: $filename (Identical)"
        fi
    else
        echo "New file: $filename"
        cp "$file" "$dest_file"
    fi
done

echo "Sync complete."
rm -rf "$SOURCE_DIR" "$DEST_DIR"

Explanation

File synchronization is the process of ensuring two or more locations contain the same data. While tools like rsync are the industry standard for this, writing your own sync script is an excellent way to understand the underlying mechanics of file integrity. The simplest form of sync checks file modification times, but this can be unreliable if timestamps are preserved incorrectly or if a file is "touched" without changing its content.

A more robust method involves Checksums (or Hashes). A checksum is a digital fingerprint of a file's contents. If even a single bit of the file changes, the checksum changes completely. Common algorithms include MD5, SHA-1, and SHA-256. By comparing the checksum of the source file against the destination file, you can be 100% certain whether they differ, regardless of what the file's timestamp says.

This script uses md5sum to generate these fingerprints. It iterates through the source directory, checks if a corresponding file exists in the destination, and compares their hashes. If the hashes differ (or the destination file is missing), it performs a copy. This ensures the destination becomes an exact mirror of the source.

Code Breakdown

17
basename strips the directory path from a filename (e.g., source_data/file1.txt becomes file1.txt). This allows us to construct the corresponding path in the destination directory.
23
md5sum calculates the hash. We pipe it to awk '{print $1}' because the command outputs both the hash and the filename, and we only want the hash string for comparison.