How to Build a Simple Plagiarism Checker in Python

Learn how to build a basic plagiarism checker using Python. A step-by-step guide for beginners to check text similarity without complex math.

December 07, 2025
11 min read
#python #plagiarism #automation #text-processing #scikit-learn
How to Build a Simple Plagiarism Checker in Python - Budibadu

Imagine you are a teacher with a stack of fifty essays to grade. You start reading one, and it sounds suspiciously familiar. Did you read this exact paragraph five minutes ago? Or maybe you are a content creator trying to make sure your new article is 100% original. In the digital age, copying and pasting is easier than ever, making plagiarism a common problem.

But here is the good news: catching it is also easier, thanks to programming! Today, we are going to build a simple Plagiarism Checker using Python. You don't need to be a math genius or a computer scientist to understand it. We will use simple tools to compare text and see how similar they are. By the end of this tutorial, you will have a working script that can compare two text files and tell you if they are copies.

Why Use Python?

When it comes to text processing and analysis, Python is the undisputed king. While you could technically build a plagiarism checker in any language—be it C++, Java, or even JavaScript—Python offers a unique combination of readability and power that makes it perfect for this task.

First and foremost is Simplicity. Python code reads almost like English. This means you can focus on the logic of your problem (finding similar text) rather than fighting with complex syntax or memory management. For a beginner project like this, being able to write a prototype in twenty lines of code is invaluable.

Second is the Ecosystem. Python comes with a "batteries-included" standard library. We don't need to download huge external packages to do basic text comparison; the tools we need are already built-in. Furthermore, if you decide to take this project to a professional level later, Python has industry-standard libraries like NLTK (Natural Language Toolkit) and scikit-learn ready to handle complex machine learning tasks.

Finally, it is completely Free and Private. Using online plagiarism checkers often requires paying a subscription fee or, worse, uploading your private documents to a remote server where you lose control over them. By building your own local tool, you ensure your data never leaves your computer.

The Magic Tool: difflib

To compare text, we will use a Python library called difflib. Think of it as a very patient editor who sits down with two pieces of paper and checks them word by word, highlighting the differences.

Specifically, we will use a class called SequenceMatcher. Under the hood, this uses an algorithm similar to the Gestalt Pattern Matching approach. It doesn't just count matching words; it actually looks for the longest contiguous matching subsequence. It tries to align the two texts to find the biggest chunks that are the same.

The result is a similarity ratio, a decimal number between 0 and 1:

  • 0.0 means the texts are completely different (no commonality).
  • 1.0 means the texts are exactly the same (100% match).
  • 0.75 means they are 75% similar.

Step 1: The Basic Code

Let's write our first script. Open your Python editor (like Thonny, VS Code, or IDLE) and create a new file named checker.py. We will start by importing the necessary module and defining a simple function.

python
from difflib import SequenceMatcher

def check_similarity(text1, text2):
    """
    Calculates the similarity ratio between two strings.
    Returns a float between 0 and 1.
    """
    # usage: SequenceMatcher(isjunk, a, b)
    # We set isjunk to None for now to keep it simple.
    matcher = SequenceMatcher(None, text1, text2)
    return matcher.ratio()

# Let's test it with two simple sentences
sentence_a = "Python is a great programming language."
sentence_b = "Python is a wonderful programming language."

score = check_similarity(sentence_a, sentence_b)
print(f"Similarity Score: {score}")
# Output will be around 0.85 (85%)

Code Breakdown

1.
from difflib import SequenceMatcher Imports the specific class needed for comparison from Python's standard library, keeping the memory footprint low.
2.
SequenceMatcher(None, text1, text2) Creates a new matcher object. The first argument None defines that we don't want to ignore any specific junk characters (like whitespace) for this simple check.
3.
matcher.ratio() Returns a floating-point number between 0 and 1 representing the similarity. A result of 0.85 indicates 85% similarity.

If you run this code, you will see a similarity score of approximately 0.87 (87%). The technical reason for this high score lies in how SequenceMatcher calculates its ratio. It uses the formula 2 * M / T, where M is the count of matching characters and T is the total number of characters in both strings. In our example, the algorithm successfully aligns the identical phrases "Python is a " and " programming language.", leaving only the words "great" and "wonderful" as differences. Because the matching sections make up the vast majority of the text, the mathematical ratio remains very high.

Step 2: Cleaning the Text (Preprocessing)

Computers are famously literal. To a computer, "Hello" and "hello" are completely different strings because the ASCII value of 'H' is different from 'h'. Similary, "Python" (no punctuation) and "Python." (with a period) are different. This can ruin our plagiarism detection.

To make our checker smarter, we must preprocess the text data. This involves two main steps:

  1. Normalization: Converting all text to lowercase so case differences don't matter.
  2. Cleaning: Removing punctuation symbols (periods, commas, quotes) that don't add semantic meaning.

Here is the improved version of our code:

python
import string
from difflib import SequenceMatcher

def clean_text(text):
    """
    Converts text to lowercase and removes punctuation.
    """
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Remove punctuation
    # string.punctuation contains symbols like !"#$%&'()*+,-.
    # str.maketrans creates a mapping table that replaces punctuation with None
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    
    return text

def check_similarity(text1, text2):
    # Clean both texts first before comparing
    cleaned1 = clean_text(text1)
    cleaned2 = clean_text(text2)
    
    matcher = SequenceMatcher(None, cleaned1, cleaned2)
    return matcher.ratio()

# Test with messy input
text_a = "I love coding in Python!"
text_b = "i love coding in python"

print(f"Similarity: {check_similarity(text_a, text_b)}")
# Output: 1.0 (100% match)

Code Breakdown

1.
import string Imports standard string constants. We specifically need string.punctuation which contains all standard punctuation marks.
2.
text.lower() Converts the entire string to lowercase. This ensures that "Python" and "python" are treated as identical.
3.
str.maketrans('', '', string.punctuation) Creates a translation table. The third argument specifies text to delete. It tells Python: "map every punctuation character to None".
4.
text.translate(translator) Applies the translation table to the string. This is extremely fast because it happens at the C level, replacing regular expression overhead.

Now, even though text_a has capitalization and an exclamation mark, the computer effectively sees "i love coding in python" for both inputs, resulting in a perfect match score of 1.0.

Step 3: Comparing Real Files

In the real world, you aren't checking single hardcoded sentences—you are checking entire documents saved on your disk. Let's create a robust function that reads text lists from files. We will also add error handling to make sure our program doesn't crash if a file is missing.

python
def check_files(file_path1, file_path2):
    try:
        # Open the first file and read its content
        print(f"Reading {file_path1}...")
        with open(file_path1, 'r', encoding='utf-8') as f1:
            content1 = f1.read()
            
        # Open the second file and read its content
        print(f"Reading {file_path2}...")
        with open(file_path2, 'r', encoding='utf-8') as f2:
            content2 = f2.read()
            
        # Check similarity
        similarity = check_similarity(content1, content2)
        
        # Print the result as a percentage
        percentage = similarity * 100
        print(f"\nResults:")
        print(f"The files are {percentage:.2f}% similar.")
        
        if percentage > 50:
            print("WARNING: High similarity detected!")
        else:
            print("These documents appear different.")
        
    except FileNotFoundError:
        print("Error: One of the files was not found. Please check the filename.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# To run the program:
# Check if you have 'essay1.txt' and 'essay2.txt' in the same folder
# check_files('essay1.txt', 'essay2.txt')

Code Breakdown

1.
try...except Sets up an error handling block. If the file doesn't exist, the program jumps to the except block instead of crashing.
2.
with open(...) Safe file handling. It opens the file and guarantees it will be closed automatically when the block ends, preventing resource leaks.
3.
encoding='utf-8' Crucial for modern text. It ensures that special characters, emojis, or non-English letters don't cause encoding errors.
4.
percentage > 50 A simple threshold check. You can adjust this number based on how strict you want your plagiarism detection to be.

This final script is much more user-friendly. It tells you what it is doing ("Reading..."), it handles missing files gracefully, and it gives a qualitative warning if the similarity overrides a certain threshold (like 50%).

Step 4: Checking a Whole Folder

Comparing two files is great, but what if you have a folder with 20 essays? You don't want to run the script 20 times manually.

In a real-world scenario, you rarely compare just one pair of documents. Teachers usually need to check a new student submission against an entire archive of past papers. Similarly, a content manager might want to ensure a new blog post doesn't cannibalize existing content on the site. To handle this efficiently, we need a loop that iterates through every file in a specific directory.

Our new function scan_directory will automate this process. It will take a "target file" (the one being checked) and a path to a folder containing all other documents. It calculates the similarity score for each one and, most importantly, sorts the results so the most similar files appear at the top of the list. This helps you instantly spot potential copies without wading through irrelevant data.

Let's write a function to scan a directory.

Code
import os

def scan_directory(target_file, folder_path):
    # Read the target file
    with open(target_file, 'r', encoding='utf-8') as f:
        target_content = clean_text(f.read())
    
    print(f"Scanning files in {folder_path}...")
    
    files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
    results = []
    
    for filename in files:
        path = os.path.join(folder_path, filename)
        if path == target_file: continue
        
        with open(path, 'r', encoding='utf-8') as f:
            content = clean_text(f.read())
            
        score = SequenceMatcher(None, target_content, content).ratio()
        results.append((filename, score))
    
    # Sort by highest similarity
    results.sort(key=lambda x: x[1], reverse=True)
    
    for name, score in results:
        print(f"{name}: {score:.1%}")

# Usage:
# scan_directory("my_essay.txt", "./class_essays")

Code Breakdown

1.
os.listdir(folder_path) Retrieves a list containing the names of the entries in the directory given by folder_path.
2.
if f.endswith('.txt') A list comprehension filter. It ensures we only process text files, ignoring system files or images.
3.
if path == target_file Self-check prevention. This skips the comparison if the file we are checking is the same as the source file (which would logically be a 100% match).
4.
key=lambda x: x[1] Sorting logic. It tells the sort function to look at the second item in each tuple (the score) and sort in descending order (reverse=True).

Step 5: Advanced Detection with TF-IDF

The SequenceMatcher is excellent for finding structural similarity (copy-paste), but it can be tricked by reordered words or slight paraphrasing. For a smarter checker, data scientists use TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity.

This approach converts text into mathematical vectors, where each word has a weight. We can use the popular machine learning library scikit-learn to implement this easily.

Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "The lazy dog is jumped over by the quick brown fox.",
    "Python is a great programming language."
]

# 1. Vectorize the text
# TF-IDF converts words to numbers, weighing rare words more heavily
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# 2. Calculate Cosine Similarity
# This measures the angle between vectors (0 to 1)
similarity_matrix = cosine_similarity(tfidf_matrix)

print("Similarity Matrix:")
print(similarity_matrix)

Code Breakdown

1.
TfidfVectorizer() Initializes the converter. 'TF-IDF' stands for Term Frequency-Inverse Document Frequency, which statistically determines how important a word is to a document.
2.
fit_transform(documents) Performs two steps: 'fit' learns the vocabulary (all unique words), and 'transform' converts the documents into a matrix of numbers based on that vocabulary.
3.
cosine_similarity(tfidf_matrix) Calculates the cosine of the angle between vectors. A result of 1 means they point in exactly the same direction (identical), while 0 means they are unrelated.

If you run this, you will see that the first two sentences have a high similarity score (close to 1.0) because they use the same words, even though the order is different! This is the power of the Bag-of-Words model.

Limitations and Conclusion

Congratulations! You have built a functional plagiarism checker and explored advanced techniques. While difflib is perfect for simple scripts, TF-IDF opens the door to professional text analysis.

To take this further, you could build a web interface using Flask or Streamlit to let users upload files directly. You could also explore N-grams to preserve some word ordering in your TF-IDF model. The possibilities are endless when you have the power of Python at your fingertips!

Share:
Found this helpful?