ScholarGate
Assistent

Information Extraction

Information extraction is the task of automatically identifying structured information—entities, relations, and events—within unstructured natural-language text.

Find emne med PaperMindSnartFind papers & topics
Tools & resources
Hent slides
Learn & explore
VideoSnart

Definition

Information extraction converts unstructured text into structured representations by detecting and classifying mentions of entities, the relations among them, and the events they participate in, often to populate a database or knowledge base.

Scope

This topic covers the extraction of structured facts from text: named entity recognition, relation extraction, event extraction, coreference resolution, and the population of templates or knowledge bases. It addresses rule-based, statistical sequence-labeling, and supervised and distantly supervised approaches, and the evaluation of extraction by precision and recall. The general machine-learning methods used to train extractors belong to the machine-learning subfield; here the emphasis is on the extraction tasks and their linguistic challenges.

Core questions

  • How are mentions of entities such as people, organizations, and locations detected and classified in text?
  • How are relations between entities identified and extracted?
  • How are events and their participants recognized, and how is coreference resolved?
  • How is extraction performance evaluated, and what trade-offs arise between precision and recall?

Key concepts

  • named entity recognition
  • relation extraction
  • event extraction
  • coreference resolution
  • BIO sequence labeling
  • template filling
  • knowledge base population
  • precision and recall

Key theories

Named entity recognition as sequence labeling
Identifying entity mentions is commonly framed as labeling each token with a tag (for example, using a BIO scheme), solved by sequence models that exploit context to mark spans and their types.
Relation and event extraction
Beyond entities, information extraction identifies how entities relate and what events occur, filling structured templates; this task-driven framing was crystallized by the Message Understanding Conferences.
Knowledge base population
Extracted entities and relations can be aggregated to build or extend a knowledge base, linking mentions to canonical entities and accumulating facts from large text collections.

Clinical relevance

Information extraction turns text into queryable data for applications such as biomedical literature mining, financial and news analytics, building knowledge graphs, and populating databases from documents, making large volumes of unstructured text usable by downstream systems.

History

Information extraction was shaped by the Message Understanding Conferences (MUC) of the late 1980s and 1990s, which defined tasks such as named entity recognition and template filling and introduced standardized evaluation. The field moved from hand-built rules to statistical sequence models and later neural methods, while keeping its task structure.

Key figures

  • Ralph Grishman
  • Beth Sundheim
  • Christopher D. Manning
  • Daniel Jurafsky

Related topics

Seminal works

  • grishman1996
  • jurafsky2023

Frequently asked questions

What is named entity recognition?
Named entity recognition is the task of finding and classifying spans of text that name real-world entities, such as people, organizations, locations, and dates. It is usually a first step in information extraction, since many relations and events are stated in terms of these entities.
How is information extraction evaluated?
Extraction is typically evaluated with precision (what fraction of extracted items are correct) and recall (what fraction of the correct items were extracted), often combined into an F-measure. This reflects the trade-off between extracting too little and extracting incorrect information.

Methods for this concept

Related concepts