What is grapheme-to-phoneme conversion?

It is the step that predicts how written words are pronounced, mapping letters to phonetic symbols. It is essential because spelling is an imperfect guide to pronunciation, especially for names and unfamiliar words.

Speech Synthesis

Generating natural-sounding speech from text, combining linguistic front-end analysis — normalization, pronunciation, and prosody — with waveform generation from concatenative to neural methods.

Găsește o temă cu PaperMindÎn curândFind papers & topics

Tools & resources

Descarcă prezentarea

Learn & explore

VideoÎn curând

Definition

Speech synthesis, or text-to-speech, is the computational generation of an intelligible and natural speech signal from input text.

Scope

Covers text-to-speech synthesis: the front-end that normalizes text and predicts pronunciation and prosody, and the back-end that produces the waveform, spanning concatenative, parametric, and neural approaches. It addresses grapheme-to-phoneme conversion and prosodic modeling. Speech recognition is covered in a sibling topic.

Core questions

How is written text normalized and converted to pronunciations?
How is prosody — rhythm, stress, and intonation — predicted and rendered?
How do concatenative, parametric, and neural synthesis differ?
How is synthesized speech evaluated for intelligibility and naturalness?

Key concepts

text normalization
grapheme-to-phoneme conversion
prosody
concatenative synthesis
parametric synthesis
neural vocoder
intelligibility
naturalness

Key theories

Front-end linguistic processing: Converting raw text into a linguistic specification through normalization, grapheme-to-phoneme conversion, and prosody prediction before any waveform is generated.
Waveform generation paradigms: Producing audio by concatenating recorded units, by statistical parametric models, or by neural networks that generate the waveform directly for high naturalness.

History

Early synthesis used rule-based formant and then concatenative methods that stitched together recorded units, surveyed thoroughly by Taylor. Statistical parametric synthesis improved flexibility in the 2000s, and neural waveform models in the late 2010s produced speech approaching human naturalness.

Debates

Naturalness versus controllability: Neural synthesis is highly natural but can be harder to control for specific prosody or speaker traits than earlier parametric methods, posing a trade-off for expressive applications.

Key figures

Paul Taylor
Daniel Jurafsky
James H. Martin

Seminal works

taylor2009
jurafsky2025

Frequently asked questions

What is grapheme-to-phoneme conversion?: It is the step that predicts how written words are pronounced, mapping letters to phonetic symbols. It is essential because spelling is an imperfect guide to pronunciation, especially for names and unfamiliar words.