Speech Synthesis
Generating natural-sounding speech from text, combining linguistic front-end analysis — normalization, pronunciation, and prosody — with waveform generation from concatenative to neural methods.
Definition
Speech synthesis, or text-to-speech, is the computational generation of an intelligible and natural speech signal from input text.
Scope
Covers text-to-speech synthesis: the front-end that normalizes text and predicts pronunciation and prosody, and the back-end that produces the waveform, spanning concatenative, parametric, and neural approaches. It addresses grapheme-to-phoneme conversion and prosodic modeling. Speech recognition is covered in a sibling topic.
Core questions
- How is written text normalized and converted to pronunciations?
- How is prosody — rhythm, stress, and intonation — predicted and rendered?
- How do concatenative, parametric, and neural synthesis differ?
- How is synthesized speech evaluated for intelligibility and naturalness?
Key concepts
- text normalization
- grapheme-to-phoneme conversion
- prosody
- concatenative synthesis
- parametric synthesis
- neural vocoder
- intelligibility
- naturalness
Key theories
- Front-end linguistic processing
- Converting raw text into a linguistic specification through normalization, grapheme-to-phoneme conversion, and prosody prediction before any waveform is generated.
- Waveform generation paradigms
- Producing audio by concatenating recorded units, by statistical parametric models, or by neural networks that generate the waveform directly for high naturalness.
History
Early synthesis used rule-based formant and then concatenative methods that stitched together recorded units, surveyed thoroughly by Taylor. Statistical parametric synthesis improved flexibility in the 2000s, and neural waveform models in the late 2010s produced speech approaching human naturalness.
Debates
- Naturalness versus controllability
- Neural synthesis is highly natural but can be harder to control for specific prosody or speaker traits than earlier parametric methods, posing a trade-off for expressive applications.
Key figures
- Paul Taylor
- Daniel Jurafsky
- James H. Martin
Related topics
Seminal works
- taylor2009
- jurafsky2025
Frequently asked questions
- What is grapheme-to-phoneme conversion?
- It is the step that predicts how written words are pronounced, mapping letters to phonetic symbols. It is essential because spelling is an imperfect guide to pronunciation, especially for names and unfamiliar words.