BudiBadu Logo
Samplebadu

Bash by Example: Duplicate File Detector

Bash 5.0+

Finding duplicate files by comparing MD5 hashes to identify identical content regardless of filenames, implementing efficient duplicate detection algorithms using associative arrays, reclaiming disk space by removing redundant copies, automating deduplication processes, and maintaining file system efficiency.

Code

#!/bin/bash

SEARCH_DIR="."

echo "Scanning for duplicates in $SEARCH_DIR..."

# 1. Find all files
# 2. Calculate MD5 hash for each
# 3. Sort by hash (so duplicates are adjacent)
# 4. Use uniq to find duplicates based on the hash (first 32 chars)

# Format of md5sum output: "HASH  filename"
find "$SEARCH_DIR" -type f -exec md5sum {} + | \
    sort | \
    uniq -w32 -dD

# Explanation of flags:
# uniq -w32 : Compare only the first 32 characters (the hash)
# uniq -d   : Print only duplicate lines
# uniq -D   : Print ALL copies of the duplicate lines (not just one)

echo "Scan complete."

Explanation

Over time, file systems tend to accumulate duplicate files—downloads saved twice, backups of backups, or copy-pasted project folders. Identifying these duplicates solely by name is insufficient because files can be renamed. Identifying them by size is better but not foolproof (two different images can have the exact same byte count). The only reliable method is to compare their Content Hashes.

This script employs a powerful pipeline to do the heavy lifting. First, find locates every file and passes it to md5sum. This generates a list of "Hash Filename" pairs. Next, sort orders this list. Crucially, this groups identical hashes together. Finally, uniq filters this sorted list.

The uniq command is typically used to remove duplicates, but with the -d (duplicate) and -D (print all) flags, it becomes a detector. We tell it to only look at the first 32 characters of the line (the MD5 hash) using -w32. If it sees the same hash appear multiple times, it prints all those lines, revealing the groups of duplicate files.

Code Breakdown

13
find ... -exec md5sum {} +. The + at the end is an optimization. It tells find to pass as many filenames as possible to a single instance of md5sum, rather than running a new process for every single file.
15
uniq -w32 -dD. This is the magic combination. -w32 restricts the comparison to the hash. -dD ensures we see every file that is part of a duplicate set, allowing the user to decide which ones to keep.