Process / pipeline

Text Deduplication — Near-Duplicate Detection

Text Deduplication (Near-Duplicate Detection) · Also known as: near-duplicate detection, document deduplication, corpus deduplication, Metin Tekilleştirme (Near-Duplicate Detection)

Text deduplication is a corpus-quality pipeline that identifies and removes exact and near-duplicate documents from large text collections. Grounded in Andrei Broder's 1997 resemblance theory, it is widely used to improve dataset quality for machine learning model training, search engine indexing, and any downstream NLP task that assumes a non-redundant corpus.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Text Deduplication

BERT Embeddings Sentiment Analysis Text Classification TF-IDF Topic Modeling

When to use it

Text deduplication applies whenever you are working with a large text collection — crawled web data, scraped corpora, academic archives, or user-generated content — where redundant or near-identical documents are expected. It is a prerequisite step before training language models, building search indices, or running any analysis that assumes document independence. Two assumptions must hold: a similarity threshold must be defined by the researcher before processing, and for large-scale collections (tens of thousands of documents or more) the MinHash–LSH approach must be used because exact pairwise comparison is computationally infeasible. A minimum corpus size of around 50 documents is reasonable for any deduplication run; below that, the problem is typically solved by manual inspection.

Strengths & limitations

Strengths

Scales to very large corpora: MinHash–LSH reduces the computational cost from quadratic to near-linear, making it practical for millions of documents.
Catches near-duplicates as well as exact copies, handling minor edits, reformatted content, and mirrored pages that byte-level hashing would miss.
Directly improves downstream model quality: Lee et al. (2022) demonstrated that deduplicating training data produces better language models with less memorisation of repeated content.

Limitations

Requires a user-defined similarity threshold; the right value depends on the corpus and the downstream task, and a poor choice either over-removes distinct documents or leaves too many near-duplicates.
MinHash operates on token sets (bag-of-words), so it does not capture word order or semantic similarity — two documents with the same words in different arrangements receive the same fingerprint.
For very small corpora (under ~50 documents) the overhead of building an LSH index is unnecessary; simpler exact matching or manual review is more appropriate.

Frequently asked

What is Jaccard similarity and why does it matter for deduplication?

Jaccard similarity measures the overlap between two sets — in this context, the sets of token shingles of two documents. It equals the size of the intersection divided by the size of the union. Two documents with Jaccard similarity above the chosen threshold are treated as near-duplicates. MinHash provides an unbiased estimator of Jaccard similarity without comparing the full token sets, which is what makes large-scale deduplication tractable.

How do I choose the similarity threshold?

There is no universally correct value. A common starting point for web text is 0.5 (50% token overlap), but the right threshold depends on your corpus and task. Inspect a sample of document pairs at candidate threshold values — pairs that a human judge considers duplicates should fall above the threshold, and genuinely distinct documents should fall below it. Adjust until the false-positive and false-negative rates are acceptable for your use case.

Does deduplication remove all copies or keep one?

Deduplication identifies clusters of near-duplicate documents and retains one canonical representative per cluster, discarding or tagging the rest. Which copy is kept — the longest, the earliest, the highest-quality — depends on the resolution policy you define. The pipeline flags duplicates; the retention decision is yours.

Is deduplication only relevant for language model training?

No. Deduplication improves data quality for any downstream NLP task that assumes document independence — sentiment analysis, topic modelling, information retrieval, and corpus statistics all benefit from a non-redundant corpus. The importance for language model training was highlighted by Lee et al. (2022), but the problem predates that work by decades.

Sources

Broder, A.Z. (1997). On the Resemblance and Containment of Documents. Compression and Complexity of SEQUENCES. link ↗
Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022. link ↗

How to cite this page

ScholarGate. (2026, June 1). Text Deduplication (Near-Duplicate Detection). ScholarGate. https://scholargate.app/en/text-mining/text-deduplication

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

Related reference concepts

Text Clustering Document Representation and Weighting Corpus Linguistics and Web Corpora Text Classification Corpus Building and Curation Latent Semantic and Topic Models

Spotted an issue on this page? Report or suggest a fix →

Process / pipeline

Text Deduplication — Near-Duplicate Detection

Text Deduplication (Near-Duplicate Detection) · Also known as: near-duplicate detection, document deduplication, corpus deduplication, Metin Tekilleştirme (Near-Duplicate Detection)

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Text Deduplication

BERT Embeddings Sentiment Analysis Text Classification TF-IDF Topic Modeling

When to use it

Strengths & limitations

Strengths

Scales to very large corpora: MinHash–LSH reduces the computational cost from quadratic to near-linear, making it practical for millions of documents.
Catches near-duplicates as well as exact copies, handling minor edits, reformatted content, and mirrored pages that byte-level hashing would miss.
Directly improves downstream model quality: Lee et al. (2022) demonstrated that deduplicating training data produces better language models with less memorisation of repeated content.

Limitations

Requires a user-defined similarity threshold; the right value depends on the corpus and the downstream task, and a poor choice either over-removes distinct documents or leaves too many near-duplicates.
MinHash operates on token sets (bag-of-words), so it does not capture word order or semantic similarity — two documents with the same words in different arrangements receive the same fingerprint.
For very small corpora (under ~50 documents) the overhead of building an LSH index is unnecessary; simpler exact matching or manual review is more appropriate.

Frequently asked

What is Jaccard similarity and why does it matter for deduplication?

How do I choose the similarity threshold?

Does deduplication remove all copies or keep one?

Is deduplication only relevant for language model training?

Sources

Broder, A.Z. (1997). On the Resemblance and Containment of Documents. Compression and Complexity of SEQUENCES. link ↗
Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022. link ↗

How to cite this page

ScholarGate. (2026, June 1). Text Deduplication (Near-Duplicate Detection). ScholarGate. https://scholargate.app/en/text-mining/text-deduplication

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

Related reference concepts

Text Clustering Document Representation and Weighting Corpus Linguistics and Web Corpora Text Classification Corpus Building and Curation Latent Semantic and Topic Models

Spotted an issue on this page? Report or suggest a fix →

Text Deduplication — Near-Duplicate Detection

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts

Text Deduplication — Near-Duplicate Detection

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts