Process / pipeline

Text Deduplication — Near-Duplicate Detection

Text deduplication is a corpus-quality pipeline that identifies and removes exact and near-duplicate documents from large text collections. Grounded in Andrei Broder's 1997 resemblance theory, it is widely used to improve dataset quality for machine learning model training, search engine indexing, and any downstream NLP task that assumes a non-redundant corpus.

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Broder, A.Z. (1997). On the Resemblance and Containment of Documents. Compression and Complexity of SEQUENCES. link
  2. Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022. link

Related methods

ScholarGateText Deduplication (Text Deduplication (Near-Duplicate Detection)). Retrieved 2026-06-04 from https://scholargate.app/en/text-mining/text-deduplication