ScholarGate
Asistents

Salīdzināt metodes

Apskatiet izvēlētās metodes blakus; rindas, kas atšķiras, ir izceltas.

Teksta deduplikācija×Tēmu modelēšana×
NozareTeksta ieguveDziļā mācīšanās
SaimeProcess / pipelineMachine learning
Izcelsmes gads19971999–2003
AutorsAndrei Z. Broder (MinHash / Resemblance theory, 1997)Hofmann, T. (pLSA, 1999); Blei, D. M., Ng, A. Y., & Jordan, M. I. (LDA, 2003)
TipsText preprocessing / corpus quality pipelineUnsupervised generative probabilistic model
PirmavotsBroder, A.Z. (1997). On the Resemblance and Containment of Documents. Compression and Complexity of SEQUENCES. link ↗Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022. link ↗
Citi nosaukuminear-duplicate detection, document deduplication, corpus deduplication, Metin Tekilleştirme (Near-Duplicate Detection)Latent Semantic Analysis, probabilistic topic modeling, topic discovery, thematic modeling
Saistītās55
KopsavilkumsText deduplication is a corpus-quality pipeline that identifies and removes exact and near-duplicate documents from large text collections. Grounded in Andrei Broder's 1997 resemblance theory, it is widely used to improve dataset quality for machine learning model training, search engine indexing, and any downstream NLP task that assumes a non-redundant corpus.Topic Modeling is a family of unsupervised probabilistic techniques for discovering latent thematic structure in large text collections. By learning which words tend to co-occur, models such as Latent Dirichlet Allocation (LDA) automatically surface coherent topics — each represented as a distribution over vocabulary — without requiring labelled data.
ScholarGateDatu kopa
  1. v1
  2. 2 Avoti
  3. PUBLISHED
  1. v1
  2. 2 Avoti
  3. PUBLISHED

Doties uz meklēšanu Lejupielādēt slaidus

ScholarGateSalīdzināt metodes: Text Deduplication · Topic Modeling. Izgūts 2026-06-15 no https://scholargate.app/lv/compare