Machine learning

FastText

FastText: Subword-Level Word Embeddings and Efficient Text Classification · Also known as: fastText, fast text, subword embedding, character n-gram embedding, bag of tricks text classification

FastText is a word embedding and text classification framework developed by Facebook AI Research (Joulin, Bojanowski, Grave, and Mikolov, 2016–2017) that represents each word as the sum of its character n-gram vectors, allowing it to construct meaningful representations for unseen and morphologically rich words and to perform near state-of-the-art text classification orders of magnitude faster than deep neural network alternatives.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

FastText

Naive Bayes Word2Vec Self-supervised Word2Vec

When to use it

FastText is appropriate when you need efficient word representations that generalise to morphologically rich languages or domains with many rare and unseen words, and when training speed is a constraint. For text classification it excels on medium-to-large corpora (tens of thousands of examples or more) with moderate to high numbers of categories. It suits sentiment analysis, language identification, topic labelling, and multi-label tagging. Assumptions: input is tokenised text; the embedding dimension and n-gram range are treated as hyperparameters. For very short texts or extremely small datasets, simpler bag-of-words baselines may perform equally well with less overhead. When contextual representations are critical — e.g., disambiguating polysemous words — consider Transformer-based models such as BERT.

Strengths & limitations

Strengths

Handles out-of-vocabulary words naturally by composing subword fragment vectors, making it robust for morphologically rich languages (Finnish, Turkish, Arabic, etc.).
Training is extremely fast: millions of words per second on a standard CPU, enabling iteration on large corpora without GPU infrastructure.
Text classification accuracy is competitive with deep convolutional networks at a fraction of the computational cost.
Pre-trained vectors are publicly available for 157 languages, enabling immediate transfer without local training.
The linear classifier is easy to inspect and its decision boundary is transparent compared to deep architectures.

Limitations

Word representations are context-free: a word receives the same vector regardless of its meaning in a given sentence, unlike contextual models such as BERT or GPT.
The mean-pooling document representation loses word order and syntactic structure, which can hurt performance on tasks requiring compositional understanding.
Character n-gram vocabulary can become very large for morphologically rich languages, increasing memory footprint.
Performance on very small datasets (fewer than a few hundred documents) may not surpass simple tf-idf plus logistic regression baselines.
The model is less suited to tasks requiring fine-grained contextual disambiguation, such as named entity recognition or coreference resolution.

Frequently asked

How does FastText differ from Word2Vec?

Word2Vec assigns one vector per whole-word token and cannot represent words absent from training. FastText decomposes every word into character n-grams and sums their vectors, so it produces an embedding for any word — including misspellings and neologisms — by composing fragment representations. This makes FastText substantially more robust on morphologically rich languages and noisy text.

Should I use FastText or BERT for my NLP task?

FastText is preferable when training speed, low memory, and out-of-vocabulary robustness are priorities, or when GPU resources are unavailable. BERT and Transformer-based models are preferable when the task requires contextual word-sense disambiguation, when fine-grained linguistic structure matters, or when a pre-trained model can be fine-tuned on a moderately sized labelled dataset and inference latency is acceptable.

What n-gram range should I use?

The original papers use character n-grams of length 3–6 for word embeddings and character n-grams up to length 2 (word bigrams) for text classification. For agglutinative languages with long morphemes, extending the maximum to 6 or 7 can improve coverage. In practice, it is best to treat the minimum and maximum n-gram lengths as hyperparameters and tune them on a held-out validation set.

Can FastText handle multiple languages in one model?

Yes. Because the model operates at the character n-gram level, a single model trained on multilingual text can share subword representations across languages that use the same script. Facebook has released aligned multilingual FastText vectors covering 157 languages trained on Wikipedia, which can be used for cross-lingual transfer learning.

Sources

Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. In Proceedings of EACL 2017, Short Papers, pp. 427–431. ACL. DOI: 10.18653/v1/e17-2068 ↗
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. DOI: 10.1162/tacl_a_00051 ↗
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers. ISBN: 978-1-62705-298-6

How to cite this page

ScholarGate. (2026, June 3). FastText: Subword-Level Word Embeddings and Efficient Text Classification. ScholarGate. https://scholargate.app/en/deep-learning/fasttext

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Naive BayesMachine learning↔ compare
Word2VecText mining↔ compare

Compare side by side →

Referenced by

Self-supervised Word2Vec

Related reference concepts

Neural Language Models and Word Embeddings Text Classification Text Classification and Sentiment Analysis Computational Linguistics Language Modeling Part-of-Speech Tagging and Sequence Labeling

Spotted an issue on this page? Report or suggest a fix →

Machine learning

FastText

FastText: Subword-Level Word Embeddings and Efficient Text Classification · Also known as: fastText, fast text, subword embedding, character n-gram embedding, bag of tricks text classification

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

FastText

Naive Bayes Word2Vec Self-supervised Word2Vec

When to use it

Strengths & limitations

Strengths

Handles out-of-vocabulary words naturally by composing subword fragment vectors, making it robust for morphologically rich languages (Finnish, Turkish, Arabic, etc.).
Training is extremely fast: millions of words per second on a standard CPU, enabling iteration on large corpora without GPU infrastructure.
Text classification accuracy is competitive with deep convolutional networks at a fraction of the computational cost.
Pre-trained vectors are publicly available for 157 languages, enabling immediate transfer without local training.
The linear classifier is easy to inspect and its decision boundary is transparent compared to deep architectures.

Limitations

Word representations are context-free: a word receives the same vector regardless of its meaning in a given sentence, unlike contextual models such as BERT or GPT.
The mean-pooling document representation loses word order and syntactic structure, which can hurt performance on tasks requiring compositional understanding.
Character n-gram vocabulary can become very large for morphologically rich languages, increasing memory footprint.
Performance on very small datasets (fewer than a few hundred documents) may not surpass simple tf-idf plus logistic regression baselines.
The model is less suited to tasks requiring fine-grained contextual disambiguation, such as named entity recognition or coreference resolution.

Frequently asked

How does FastText differ from Word2Vec?

Should I use FastText or BERT for my NLP task?

What n-gram range should I use?

Can FastText handle multiple languages in one model?

Sources

Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. In Proceedings of EACL 2017, Short Papers, pp. 427–431. ACL. DOI: 10.18653/v1/e17-2068 ↗
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. DOI: 10.1162/tacl_a_00051 ↗
Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers. ISBN: 978-1-62705-298-6

How to cite this page

ScholarGate. (2026, June 3). FastText: Subword-Level Word Embeddings and Efficient Text Classification. ScholarGate. https://scholargate.app/en/deep-learning/fasttext

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Naive BayesMachine learning↔ compare
Word2VecText mining↔ compare

Compare side by side →

Referenced by

Self-supervised Word2Vec

Similar methods

Related reference concepts

Neural Language Models and Word Embeddings Text Classification Text Classification and Sentiment Analysis Computational Linguistics Language Modeling Part-of-Speech Tagging and Sequence Labeling

Spotted an issue on this page? Report or suggest a fix →