Information Extraction
Turning unstructured text into structured data: detecting named entities, the relations between them, and the events they participate in, so documents can be queried and aggregated.
Definition
Information extraction is the automatic identification of structured facts — entities, relations, and events — from unstructured natural-language text.
Scope
Covers extracting structured information from text — named-entity recognition, relation extraction, event extraction, and temporal and template filling. It addresses both rule-based and learned approaches and the evaluation traditions established by shared tasks. The underlying sequence-labeling models are covered in the parsing area.
Core questions
- How are named entities detected and classified in text?
- How are relations and events between entities extracted?
- How did shared evaluations shape the task and its metrics?
- How do rule-based and learned extraction methods compare?
Key concepts
- named-entity recognition
- relation extraction
- event extraction
- template filling
- conditional random field
- distant supervision
- ontology population
- evaluation campaign
Key theories
- Template-filling information extraction
- Framing extraction as filling structured templates with entities and relations found in text, the formulation developed in the Message Understanding Conferences.
- Sequence-labeling extraction
- Casting entity and span extraction as sequence labeling with models such as conditional random fields and neural taggers over tokens.
History
Information extraction was shaped by the Message Understanding Conferences of the 1990s, which defined named-entity and template-filling tasks and their evaluation. The field moved from hand-built patterns to statistical sequence models such as conditional random fields, and then to neural and distantly supervised extraction at scale.
Debates
- Supervised versus distantly supervised extraction
- Whether to rely on costly hand-labeled data or to bootstrap from knowledge bases via distant supervision, which scales but introduces noisy labels.
Key figures
- Ralph Grishman
- Beth Sundheim
- Andrew McCallum
Related topics
Seminal works
- grishman1996
- lafferty2001
Frequently asked questions
- What is named-entity recognition?
- Named-entity recognition finds and classifies proper-name spans in text, such as people, organizations, and locations. It is usually the first step in extracting relations and events from documents.