ScholarGate
ผู้ช่วย

Automatic Speech Recognition

Transcribing spoken language into text by combining acoustic models of the speech signal with language models of word sequences, from hidden Markov model systems to end-to-end neural recognizers.

ค้นหาหัวข้อด้วย PaperMindเร็ว ๆ นี้Find papers & topics
Tools & resources
ดาวน์โหลดสไลด์
Learn & explore
วิดีโอเร็ว ๆ นี้

Definition

Automatic speech recognition is the computational task of converting an acoustic speech signal into a sequence of words.

Scope

Covers the conversion of audio to text: acoustic feature extraction, acoustic and pronunciation modeling, the role of the language model, decoding, and the shift from hidden Markov model systems to end-to-end neural recognition. It addresses evaluation by word error rate and the importance of shared corpora. Speech synthesis and downstream understanding are covered in sibling topics.

Core questions

  • How does the acoustic signal get mapped to candidate words?
  • How do acoustic and language models combine in recognition?
  • Why did neural and end-to-end models displace HMM-based systems?
  • How is recognition accuracy measured by word error rate?

Key concepts

  • acoustic model
  • language model
  • feature extraction
  • hidden Markov model
  • decoding
  • end-to-end recognition
  • word error rate
  • pronunciation model

Key theories

Acoustic and language model combination
Recognition selects the word sequence maximizing the product of an acoustic model's likelihood and a language model's prior, the noisy-channel formulation of speech recognition.
Neural sequence modeling for speech
Recurrent and attention-based networks model the temporal structure of speech directly, enabling end-to-end recognition that learns acoustic and linguistic patterns jointly.

History

Speech recognition was a major driver of statistical methods, with IBM's HMM-based systems and shared corpora such as the Wall Street Journal collection (1992) enabling steady, measurable progress. Deep neural acoustic models around 2010 and subsequent end-to-end architectures sharply reduced error rates and brought recognition into everyday devices.

Debates

Modular versus end-to-end recognition
Whether to keep separate acoustic, pronunciation, and language models or to train a single end-to-end network; end-to-end systems now lead with enough data but can be harder to adapt.

Key figures

  • Frederick Jelinek
  • Janet Baker
  • Daniel Jurafsky
  • James H. Martin

Related topics

Seminal works

  • paul1992
  • jurafsky2025

Frequently asked questions

What is word error rate?
Word error rate measures recognition quality as the proportion of words that are substituted, deleted, or inserted relative to a reference transcript, so lower values indicate more accurate transcription.

Methods for this concept

Related concepts