ScholarGate
Asistent

Speech Synthesis

Generating natural-sounding speech from text, combining linguistic front-end analysis — normalization, pronunciation, and prosody — with waveform generation from concatenative to neural methods.

Găsește o temă cu PaperMindÎn curândFind papers & topics
Tools & resources
Descarcă prezentarea
Learn & explore
VideoÎn curând

Definition

Speech synthesis, or text-to-speech, is the computational generation of an intelligible and natural speech signal from input text.

Scope

Covers text-to-speech synthesis: the front-end that normalizes text and predicts pronunciation and prosody, and the back-end that produces the waveform, spanning concatenative, parametric, and neural approaches. It addresses grapheme-to-phoneme conversion and prosodic modeling. Speech recognition is covered in a sibling topic.

Core questions

  • How is written text normalized and converted to pronunciations?
  • How is prosody — rhythm, stress, and intonation — predicted and rendered?
  • How do concatenative, parametric, and neural synthesis differ?
  • How is synthesized speech evaluated for intelligibility and naturalness?

Key concepts

  • text normalization
  • grapheme-to-phoneme conversion
  • prosody
  • concatenative synthesis
  • parametric synthesis
  • neural vocoder
  • intelligibility
  • naturalness

Key theories

Front-end linguistic processing
Converting raw text into a linguistic specification through normalization, grapheme-to-phoneme conversion, and prosody prediction before any waveform is generated.
Waveform generation paradigms
Producing audio by concatenating recorded units, by statistical parametric models, or by neural networks that generate the waveform directly for high naturalness.

History

Early synthesis used rule-based formant and then concatenative methods that stitched together recorded units, surveyed thoroughly by Taylor. Statistical parametric synthesis improved flexibility in the 2000s, and neural waveform models in the late 2010s produced speech approaching human naturalness.

Debates

Naturalness versus controllability
Neural synthesis is highly natural but can be harder to control for specific prosody or speaker traits than earlier parametric methods, posing a trade-off for expressive applications.

Key figures

  • Paul Taylor
  • Daniel Jurafsky
  • James H. Martin

Related topics

Seminal works

  • taylor2009
  • jurafsky2025

Frequently asked questions

What is grapheme-to-phoneme conversion?
It is the step that predicts how written words are pronounced, mapping letters to phonetic symbols. It is essential because spelling is an imperfect guide to pronunciation, especially for names and unfamiliar words.

Methods for this concept

Related concepts