Page 15 - Spring2022
P. 15
Figure 1. Unified schematic covering current text to speech (TTS) system designs. Colors highlight components for different types of TTS systems. Green components are shared by many types of TTS systems. See Figure 2, green and blue, and 5, green and yellow, for specific pathways.
figure represents a logical element of the TTS process as it is usually conceived. I start with a description of a generic rule-based formant synthesizer like DECtalk (Figure 1, green). I focus on this pipeline to set the baseline to show the types of changes that have been made over time to improve the technology.
Formant Synthesis from Rules
Formant synthesis systems (and virtually all other TTS systems I discuss) require some form of initial text processing (Figure 1, green). Typically, this involves tokenizing the input text stream into distinct words or tokens and text normalization to convert nonword tokens such as numbers and abbreviations into the words one would speak when reading the tokens aloud. Thus, con- sider the text input “Dr. Smith lives at 1702 S. Park Drive and can be reached by phone at 555-456-7890.” The first instance of “Dr.” must be converted to the word “doctor,” while the second instance should be replaced with the word “drive.” Given that 1702 S. Park Drive appears to be an address, a likely rendering would be “seventeen oh two south park drive.” The final phone number would be replaced with the words “five five five, four five six, seven eight nine oh,” with commas or other textual markers to indicate the appropriate phrasing for a phone number. Of course, the challenge for text normalization is to derive enough information of the textual input to make accurate guesses about things like phone numbers or addresses.
A related problem for text normalization is disambigu- ating the pronunciation of homophonous words. Often, context can provide helpful clues; if someone is “playing a bass,” they are more likely to be a musician than an actor
impersonating a fish. But sometimes disambiguation requires much deeper semantic/pragmatic knowledge that can easily be guessed from context alone. Is a shiny white bow a holiday decoration or the front of a boat?
The tokenized and normalized input text, along with any additional meta information related to prosodic proper- ties (the intonation and timing properties) derived from the initial text processing, is next passed to the text to phonetics component (Figure 1, green), which produces a symbolic phonetic representation. In the original rule-based formant synthesis systems like DECtalk, this representation consisted of little more than a string of phoneme symbols along with some formal boundary and intonation symbols. Boundary symbols indicate the degree of acoustic/phonetic separation between two adja- cent phonemes. For example, the boundaries between words are often marked by distinct acoustic features; con- sider the distinction between “gray day” and “grade A.” Moreover, the boundaries between phrases of different types are also marked by phonetic duration differences, pauses, and intonational features such as the rising pitch at the end of many questions or the falling pitch at the end of a declarative sentence.
The intonation symbols express the relative locations and types of pitch accents or “tones” relative to the phonetic symbols. Over time, a standardized system has devel- oped based on the concepts of “tones and break indices” (ToBI; e.g., Silverman et al., 1992) that describes the intonational structure of English and other languages in terms of a discrete set of tones corresponding to a rela- tive maximum or minimum in fundamental frequency
Spring 2022 • Acoustics Today 15