Process / pipeline

Textdeduplicering — Detektering av nästan-duplikat

Textdeduplicering är en pipeline för korpus-kvalitet som identifierar och tar bort exakta och nästan-duplikatdokument från stora textsamlingar. Baserad på Andrei Broders likhetsteori från 1997 används den flitigt för att förbättra datakvaliteten för träning av maskininlärningsmodeller, indexering i sökmotorer och alla efterföljande NLP-uppgifter som förutsätter en icke-redundant korpus.

Öppna i MethodMindSnartVideoSnartDownload slides

Läs hela metoden

Endast för medlemmar

Logga in med ett kostnadsfritt konto för att läsa avsnittet.

Logga in

Method map

The neighbourhood of related methods — select a node to explore.

Textdeduplicering

BERT-inbäddningar Sentimentanalys Textklassificering TF-IDF Ämnesmodellering

Källor

Broder, A.Z. (1997). On the Resemblance and Containment of Documents. Compression and Complexity of SEQUENCES. link ↗
Lee, K. et al. (2022). Deduplicating Training Data Makes Language Models Better. ACL 2022. link ↗

Så citerar du den här sidan

ScholarGate. (2026, June 1). Text Deduplication (Near-Duplicate Detection). ScholarGate. https://scholargate.app/sv/text-mining/text-deduplication

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

BERT-inbäddningarTextutvinning↔ compare
SentimentanalysTextutvinning↔ compare
TextklassificeringTextutvinning↔ compare
TF-IDFTextutvinning↔ compare
ÄmnesmodelleringDjupinlärning↔ compare

Compare side by side →

Hittade du ett fel på sidan? Rapportera eller föreslå en rättelse →