Page 16 - Spring2022
P. 16
SPEECH SYNTHESIS
(perceived as voice pitch) that aligns to a specific syl- lable within an utterance. Similarly, break indices are single-digit integers that indicate the relative separation between two elements in an utterance. ToBI-like symbol sets are often used for the boundary and intonation sym- bols in current TTS systems.
Next, the phonetics to parameters components (Figure 1, green) maps the symbolic phonetic description of the input text to a numerical representation suitable for input to a vocoder or parametric synthesizer to gener- ate a speech waveform from the numerical parameter values. Whereas the phonetic symbols imply a sequence of related acoustic events, there are no time units at the symbolic level. In a rule-based formant synthesizer like DECtalk, the phonetics to parameters component is responsible for laying out the parameters as a dynamic
time-varying sequence with defined temporal coor- dinates. Typically, parameters are updated every few milliseconds at a constant prespecified rate, for example, every five milliseconds.
Finally, the parameters to sound component (Figure 1, green), often referred to as a “vocoder,” accepts the para- metric representation of speech and generates audio output. In many parametric systems, a source/filter model of speech is adopted wherein a source signal consisting of either a periodic impulse train or white noise is passed through a digital filter representing the human vocal tract.
Application of Text to Speech to Speech-Generating Devices
Formant-based TTS systems were intelligible enough to become widely adopted by assisted communicators
in the late 1980s and 1990s, with DECtalk being the most commonly used system in the SGDs of the time (see https://bit.ly/31E9A54). Perfect Paul, which was demonstrably the most intelligible of the DECtalk voices (Green et al., 1986), was the voice of choice for many AAC users of the time. Even women would often choose to use the male Perfect Paul voice because it was more easily understood by others. Imagine attending a meeting in a conference room with multiple people using SGDs all tuned to Perfect Paul and not being entirely certain whose device had just emitted an important comment! So, although many nonvocal persons now had a voice, they did not have their own voice for communication.
In addition to not providing every AAC user with a unique voice, the formant synthesis systems of the time did not sound particularly human. As I discuss in Diphone Syn- thesis, a technique called diphone synthesis emerged as one possible way to generate more human-sounding and identity-bearing synthetic speech. But neither formant synthesis nor diphone synthesis addressed another short- coming, a lack of expressiveness. Attempts were made to create a more expressive output for DECtalk by modifying the synthesis parameters to convey emotional states such as boredom or sadness (Murray and Arnott, 1993), but they were not widely implemented.
Diphone Synthesis
Diphone systems represented an important bifurcation in TTS technology: the distinction between knowledge- based systems and data-based systems. This distinction can also be described as between rule-based systems where a human expert must design the rules and corpus- based systems where a corpus of speech data provides the
Figure 2. Component pipeline for diphone and other concatenative synthesis methods from Figure 1.
16 Acoustics Today • Spring 2022