Machine learningDeep learning / NLP / CV

Multilingual Doc2Vec

Multilingual Paragraph Vector (Doc2Vec) Model · Also known as: multilingual paragraph vector, cross-lingual Doc2Vec, multilingual PV-DM, multilingual PV-DBOW

Multilingual Doc2Vec extends the Paragraph Vector framework of Le and Mikolov (2014) to two or more languages, training document-level embeddings in a shared or aligned vector space so that semantically similar documents — regardless of their language — end up close together. It enables cross-lingual document retrieval, classification, and clustering without requiring parallel corpora or translation.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Multilingual Doc2Vec

LDA Topic Model Multilingual Sentence Em…Multilingual Transformer Sentence Embeddings

When to use it

Choose Multilingual Doc2Vec when you have document-level tasks (topic classification, clustering, cross-lingual retrieval) spanning two or more languages and lack the compute or data for large pretrained multilingual transformers. It works well with moderate-sized corpora (thousands to hundreds of thousands of documents) and produces compact, fixed-size vectors. Avoid it when sentence-level alignment matters more than document-level semantics — use Multilingual Sentence Embeddings instead. Also avoid when labelled data is abundant and a fine-tuned multilingual BERT or XLM-R would fit within compute budget, as those models consistently outperform Doc2Vec-based baselines on classification.

Strengths & limitations

Strengths

Produces fixed-length document vectors usable directly by any downstream classifier or clustering algorithm.
No parallel corpus required — monolingual text from each language suffices when alignment is done via a small lexicon.
Memory-efficient compared with large transformer-based models; inference is fast at document scale.
Unsupervised training means no labelled data is needed to learn the embeddings.
Cross-lingual transfer is straightforward: train a classifier on one language and apply to others.

Limitations

Representation quality falls well below fine-tuned multilingual transformers (e.g., mBERT, XLM-R) on most benchmarks.
Document vectors are not contextualised — the same word has one representation regardless of surrounding words.
Alignment quality degrades for typologically distant language pairs or when the anchor lexicon is small.
Training is sensitive to hyperparameters (vector size, window, epochs) and results vary across runs unless a fixed seed is set.

Frequently asked

Do I need parallel documents to train Multilingual Doc2Vec?

No. The core training uses monolingual text from each language independently. A small parallel lexicon (a few hundred word pairs) is sufficient for post-hoc alignment. Full parallel corpora improve alignment quality but are not required.

How does Multilingual Doc2Vec compare with multilingual BERT?

Multilingual BERT and XLM-R produce contextualised, subword-level representations and consistently outperform Doc2Vec on classification and retrieval benchmarks. Doc2Vec is faster to train and deploy, uses less memory, and remains a practical choice when compute is limited or when a simple fixed-vector representation is needed.

What vector dimensionality should I use?

Common choices are 100–300 dimensions. Larger vectors capture more nuance but require more data and compute. For small corpora (fewer than 10 000 documents) start with 100 and increase only if evaluation metrics improve.

PV-DM or PV-DBOW — which should I pick?

PV-DBOW is faster and often produces more consistent cross-lingual vectors because it trains the document vector directly against context words without a sliding context window. PV-DM can give richer representations for long documents. Combining both (concatenating their vectors) frequently gives the best downstream results.

How do I evaluate the quality of the multilingual embedding space?

Compute cross-lingual nearest-neighbour precision for a held-out set of known parallel document pairs, and measure precision@1 and precision@5. Additionally, run a cross-lingual classification experiment — train a logistic regression on one language's labelled data and test on another language — and compare against a monolingual baseline.

Sources

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML), PMLR 32(2), 1188–1196. link ↗
Multilingualism. Wikipedia. link ↗

How to cite this page

ScholarGate. (2026, June 3). Multilingual Paragraph Vector (Doc2Vec) Model. ScholarGate. https://scholargate.app/en/deep-learning/multilingual-doc2vec

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

LDA Topic ModelDeep learning↔ compare
Multilingual Sentence EmbeddingsDeep learning↔ compare
Multilingual TransformerDeep learning↔ compare
Sentence EmbeddingsDeep learning↔ compare

Compare side by side →

Related reference concepts

Neural Language Models and Word Embeddings Text Classification Text Clustering Text Classification and Sentiment Analysis Machine Translation Text Representation and Classification

Spotted an issue on this page? Report or suggest a fix →

Machine learningDeep learning / NLP / CV

Multilingual Doc2Vec

Multilingual Paragraph Vector (Doc2Vec) Model · Also known as: multilingual paragraph vector, cross-lingual Doc2Vec, multilingual PV-DM, multilingual PV-DBOW

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Multilingual Doc2Vec

LDA Topic Model Multilingual Sentence Em…Multilingual Transformer Sentence Embeddings

When to use it

Strengths & limitations

Strengths

Produces fixed-length document vectors usable directly by any downstream classifier or clustering algorithm.
No parallel corpus required — monolingual text from each language suffices when alignment is done via a small lexicon.
Memory-efficient compared with large transformer-based models; inference is fast at document scale.
Unsupervised training means no labelled data is needed to learn the embeddings.
Cross-lingual transfer is straightforward: train a classifier on one language and apply to others.

Limitations

Representation quality falls well below fine-tuned multilingual transformers (e.g., mBERT, XLM-R) on most benchmarks.
Document vectors are not contextualised — the same word has one representation regardless of surrounding words.
Alignment quality degrades for typologically distant language pairs or when the anchor lexicon is small.
Training is sensitive to hyperparameters (vector size, window, epochs) and results vary across runs unless a fixed seed is set.

Frequently asked

Do I need parallel documents to train Multilingual Doc2Vec?

How does Multilingual Doc2Vec compare with multilingual BERT?

What vector dimensionality should I use?

PV-DM or PV-DBOW — which should I pick?

How do I evaluate the quality of the multilingual embedding space?

Sources

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML), PMLR 32(2), 1188–1196. link ↗
Multilingualism. Wikipedia. link ↗

How to cite this page

ScholarGate. (2026, June 3). Multilingual Paragraph Vector (Doc2Vec) Model. ScholarGate. https://scholargate.app/en/deep-learning/multilingual-doc2vec

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

LDA Topic ModelDeep learning↔ compare
Multilingual Sentence EmbeddingsDeep learning↔ compare
Multilingual TransformerDeep learning↔ compare
Sentence EmbeddingsDeep learning↔ compare

Compare side by side →

Related reference concepts

Neural Language Models and Word Embeddings Text Classification Text Clustering Text Classification and Sentiment Analysis Machine Translation Text Representation and Classification

Spotted an issue on this page? Report or suggest a fix →

Multilingual Doc2Vec

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts

Multilingual Doc2Vec

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts