Automatic Speech Recognition
Transcribing spoken language into text by combining acoustic models of the speech signal with language models of word sequences, from hidden Markov model systems to end-to-end neural recognizers.
Definition
Automatic speech recognition is the computational task of converting an acoustic speech signal into a sequence of words.
Scope
Covers the conversion of audio to text: acoustic feature extraction, acoustic and pronunciation modeling, the role of the language model, decoding, and the shift from hidden Markov model systems to end-to-end neural recognition. It addresses evaluation by word error rate and the importance of shared corpora. Speech synthesis and downstream understanding are covered in sibling topics.
Core questions
- How does the acoustic signal get mapped to candidate words?
- How do acoustic and language models combine in recognition?
- Why did neural and end-to-end models displace HMM-based systems?
- How is recognition accuracy measured by word error rate?
Key concepts
- acoustic model
- language model
- feature extraction
- hidden Markov model
- decoding
- end-to-end recognition
- word error rate
- pronunciation model
Key theories
- Acoustic and language model combination
- Recognition selects the word sequence maximizing the product of an acoustic model's likelihood and a language model's prior, the noisy-channel formulation of speech recognition.
- Neural sequence modeling for speech
- Recurrent and attention-based networks model the temporal structure of speech directly, enabling end-to-end recognition that learns acoustic and linguistic patterns jointly.
History
Speech recognition was a major driver of statistical methods, with IBM's HMM-based systems and shared corpora such as the Wall Street Journal collection (1992) enabling steady, measurable progress. Deep neural acoustic models around 2010 and subsequent end-to-end architectures sharply reduced error rates and brought recognition into everyday devices.
Debates
- Modular versus end-to-end recognition
- Whether to keep separate acoustic, pronunciation, and language models or to train a single end-to-end network; end-to-end systems now lead with enough data but can be harder to adapt.
Key figures
- Frederick Jelinek
- Janet Baker
- Daniel Jurafsky
- James H. Martin
Related topics
Seminal works
- paul1992
- jurafsky2025
Frequently asked questions
- What is word error rate?
- Word error rate measures recognition quality as the proportion of words that are substituted, deleted, or inserted relative to a reference transcript, so lower values indicate more accurate transcription.