What is word error rate?

Word error rate measures recognition quality as the proportion of words that are substituted, deleted, or inserted relative to a reference transcript, so lower values indicate more accurate transcription.

Automatic Speech Recognition

Transcribing spoken language into text by combining acoustic models of the speech signal with language models of word sequences, from hidden Markov model systems to end-to-end neural recognizers.

ค้นหาหัวข้อด้วย PaperMindเร็ว ๆ นี้Find papers & topics

Tools & resources

ดาวน์โหลดสไลด์

Learn & explore

วิดีโอเร็ว ๆ นี้

Definition

Automatic speech recognition is the computational task of converting an acoustic speech signal into a sequence of words.

Scope

Covers the conversion of audio to text: acoustic feature extraction, acoustic and pronunciation modeling, the role of the language model, decoding, and the shift from hidden Markov model systems to end-to-end neural recognition. It addresses evaluation by word error rate and the importance of shared corpora. Speech synthesis and downstream understanding are covered in sibling topics.

Core questions

How does the acoustic signal get mapped to candidate words?
How do acoustic and language models combine in recognition?
Why did neural and end-to-end models displace HMM-based systems?
How is recognition accuracy measured by word error rate?

Key concepts

acoustic model
language model
feature extraction
hidden Markov model
decoding
end-to-end recognition
word error rate
pronunciation model

Key theories

Acoustic and language model combination: Recognition selects the word sequence maximizing the product of an acoustic model's likelihood and a language model's prior, the noisy-channel formulation of speech recognition.
Neural sequence modeling for speech: Recurrent and attention-based networks model the temporal structure of speech directly, enabling end-to-end recognition that learns acoustic and linguistic patterns jointly.

History

Speech recognition was a major driver of statistical methods, with IBM's HMM-based systems and shared corpora such as the Wall Street Journal collection (1992) enabling steady, measurable progress. Deep neural acoustic models around 2010 and subsequent end-to-end architectures sharply reduced error rates and brought recognition into everyday devices.

Debates

Modular versus end-to-end recognition: Whether to keep separate acoustic, pronunciation, and language models or to train a single end-to-end network; end-to-end systems now lead with enough data but can be harder to adapt.

Key figures

Frederick Jelinek
Janet Baker
Daniel Jurafsky
James H. Martin

Seminal works

paul1992
jurafsky2025

Frequently asked questions

What is word error rate?: Word error rate measures recognition quality as the proportion of words that are substituted, deleted, or inserted relative to a reference transcript, so lower values indicate more accurate transcription.