Page 51 - 2016Winter
P. 51

 niques, at around the same time, to the stuttering malady of King George VI of England.
Semi-intelligible speech was synthesized for the first time at the 1939 New York World’s Fair using the Voder, a custom- ized filter bank controlled by highly trained human opera- tors. Up to this time, it had been assumed that individual sounds were produced individually and strung together like beads on a chain. Then, during World War II, the speech spectrograph was developed for use in clandestine analyses of speech and speakers (cf. Solzhenitsyn’s 1968 classic novel The First Circle for its development and use in speaker rec- ognition in the USSR).
Observations of the acoustic spectrum for continuous speech, seen in Potter et al.’s classic 1947 book Visible Speech, turned this notion on its head. The spectrograms showed three re- markable features of speech, which are demonstrated in Fig- ure 1, bottom. (1) There are no pauses between words (e.g., “we’re shop” and “for sun”). (2) Instead, apparent pauses are due to moments of silence inherent in stop consonants, such as the “p” in “shopping.” (3) Speech sounds, or phonemes, are not independent but overlap in time with the production of their neighbors, producing coarticulation and facilitating the production of rapid speech. A good example is the first word, “we’re,” in which two energy bands converge (Figure 1, arrows). Without pauses and with overlapping sounds, it may seem a wonder that the listener can hear word bound- aries. In fact, word and phrase breaks are signaled by modi- fication of the speech sounds themselves, such as the phrase final lengthening of what would otherwise be a short un- stressed vowel in “dresses.”
Myth 2: When Synthesizing Speech, Female Speakers Can Be Treated as Small Male Speakers
In 1952, Peterson and Barney published a study of the vow- els of 76 speakers, showing that 10 American English vow- els formed separate clusters on a graph of the second versus the first formant frequencies (F2 vs. F1). Researchers then used formant synthesizers to test how closely they could mimic their own speech while tweaking the structure and parameter set. As a result, early examples of synthetic speech sounded very much like their creators (listen to audio clips at http://acousticstoday.org/klattsspeechsynthesis/ examples 4 and 6: Gunnar Fant; examples 7 and 8: John Holmes; ex- ample 9: Dennis Klatt, as described in Klatt, 1987). To match women’s and children’s speech, the parameters were simply scaled. However, those synthesized voices were not nearly as
Figure 1. Speech spectrogram of the speech signal (top) and the wide- band spectrogram (bottom) for the sentence “We’re shopping for sun- dresses” spoken by an adult female. The spectrogram shows the en- ergy from 0 to 20 kHz. Red line: more typically used frequency range, 0-5 kHz; blue arrows: regions where the formant transitions indicate coarticulation. See the text in Myths 1 and 4.
natural sounding as the male voices (e.g., example 9, "DEC Talk scaled.") What had gone wrong?
The assumption, based on Peterson and Barney’s vowel clus- ters, had been that since men had the lowest F1-F2 frequen- cies, women next lowest, and children the highest, scalar changes were sufficient to convert synthesizers to sound like women’s and children’s voices. As source-filter models of speech production were developed in the 1950s and 1960s (Fant, 1970; Flanagan, 1972), this experimental generaliza- tion made theoretical sense. After all, the formants were the resonances of the vocal tract that filtered the source pro- duced by vibration of the vocal folds. Women’s and children’s vocal tracts were shorter than men’s, so their formants were higher. Their larynges and thus vocal folds were smaller, and so their ranges of phonation frequencies were higher as well.
The missing piece was that not only the phonation frequen- cy but also the details of the voice source differed, both the spectral properties and the cycle-to-cycle variations. Meth- ods were developed to inverse filter speech and derive the glottal waveform, which when used as the source made synthesized speech sound more natural (Holmes, 1973). Defining source parameters typical of men’s or of women’s speech was also helpful (Fant et al., 1985; other studies are summarized in Klatt, 1987). Women were found to be more breathy and their glottal waveforms had a longer open quo- tient. However, as Klatt wrote, “rules for dynamic control of these [voice-source] variables are quite primitive. The lim- ited naturalness of synthetic speech from this and all other
 Winter 2016 | Acoustics Today | 49

























































































   49   50   51   52   53