Text Classification
Text classification automatically assigns documents to one or more predefined categories using models learned from labeled examples.
Definition
Text classification is the task of assigning a document to one or more categories from a predefined set, performed by a model trained on documents whose category labels are known, using the document's term-based representation as input features.
Scope
This topic covers supervised categorization of text: the problem formulation as single-label, multi-label, or hierarchical classification; representative learners applied to text such as naive Bayes, the Rocchio centroid method, k-nearest-neighbors, and support vector machines; feature selection for high-dimensional text; and the evaluation of classifiers. It treats classification as used in retrieval contexts such as filtering and routing, drawing on machine learning but focusing on text-specific considerations rather than general classifier theory.
Core questions
- How is text categorization formulated as single-label, multi-label, or hierarchical classification?
- Which learning algorithms work well on high-dimensional, sparse text features?
- How are informative features selected from a large vocabulary?
- Why are support vector machines particularly well suited to text?
- How are text classifiers evaluated, and how is class imbalance handled?
Key concepts
- supervised categorization
- single-label vs. multi-label classification
- naive Bayes
- Rocchio / centroid classification
- k-nearest-neighbors
- support vector machines
- feature selection
- classifier evaluation (precision, recall, F1)
Key theories
- Naive Bayes text classification
- Modeling each document's terms as conditionally independent given the class yields a simple, fast probabilistic classifier that, despite its strong independence assumption, performs competitively on many text tasks.
- Support vector machines for text
- Because text has many sparse, mostly relevant features and classes are often linearly separable in this space, large-margin support vector machines achieve strong text-categorization accuracy with little feature engineering.
Clinical relevance
Text classification powers email spam filtering, content moderation, topic routing and tagging, sentiment analysis, and the categorization that supports faceted search and filtering. Within retrieval it underlies document filtering and routing systems that deliver documents matching standing information needs.
History
Automatic text categorization began with hand-built rule systems and shifted to machine learning in the 1990s. Joachims's 1998 demonstration that support vector machines excel on text, and Sebastiani's 2002 survey, established the modern supervised paradigm. The same task now serves as a standard benchmark for representation-learning and neural text models.
Key figures
- Fabrizio Sebastiani
- Thorsten Joachims
- Yiming Yang
Related topics
Seminal works
- sebastiani2002
- joachims1998
- manning2008
Frequently asked questions
- Why does naive Bayes work well despite its unrealistic independence assumption?
- Even though terms are not truly independent, the naive Bayes decision often lands on the correct class because the assumption mainly distorts probability estimates rather than the relative ordering of classes. It is also fast and robust with limited data, making it a strong baseline.
- What is the difference between single-label and multi-label classification?
- Single-label classification assigns each document to exactly one category, whereas multi-label classification allows a document to belong to several categories at once, as when an article is tagged with multiple topics. Multi-label tasks need methods and metrics that handle overlapping labels.