Page 19 - Spring2022
P. 19
Figure 4. Unit selection search process for the word “two.” Two phonemes are required: /t/ (HT) and /u/ (UW) along with initial and final silence pseudo phonemes (0B and 0E). Multiple instances of each phoneme (numbers in boxes) are selected, each of which has two subphonemic “units” (e.g., HTL and HTR). Each unit receives a target cost based on linguistic appropriateness and joined costs are assigned between units based on the acoustic continuity (gray arrows). The search locates the specific candidate units that minimize the combined target and joined costs over the utterance (paths shown with blue arrows).
most SGDs transitioned from proprietary hardware to being software running on embedded Microsoft Win- dows systems. Because of this, most SGDs were also able to include voices provided by Microsoft or third-party voices written to published Microsoft standards.
My laboratory moved to a full unit selection system for voice bankers based on 1,600 utterances of various lengths and composition, comprising roughly one hour of run- ning speech at normal speaking rates. With funding from the National Institute for Disability and Rehabilitation Research and later from the National Institutes of Health (NIH), I was able to offer a free experimental voice-bank- ing service and provided a small number of voices to participants throughout most of the 2000s. Voices built in the laboratory could be incorporated with any Windows- based SGD. I formally began referring to the service as the ModelTalker project (Bunnell et al., 2005). Although the ModelTalker service was the first such service regu- larly used by ALS patients for voice banking, there are now excellent voice-banking services offered by a vari- ety of commercial TTS companies, notably Acapela.com and Cereproc.com, who also offer voices for languages other than English. I have live example voices on the ModelTalker.org website (see https://bit.ly/3C57WpT; it might be slow when the website is busy).
By the late 2000s, unit selection was considered the best available TTS technology. The major voices for services
like Siri and Alexa were built on unit selection technol- ogy as were enterprise-grade voices for large business call centers. However, the amount of recorded speech from voice talent needed to create the highest quality general- use voices exceeded tens of hours of running speech and many more hours of studio time. Even then, it is fairly easy to find examples of words that did not sound entirely natural within some specific context. There is no way to anticipate and record all of the possible acoustic phonetic variation within any language, even if factors like vocal effort, voice quality (breathy, hoarse, modal, fry, pressed), speaking rate, articulatory precision, and so forth are held constant. Moreover, for a truly natural-sounding and expressive TTS voice, one would not want to hold those factors constant!
The massive increase in memory density and decrease in memory cost over several decades made it feasible to work with unit selection voices despite their rapidly growing data footprint. But no amount of memory is really able to overcome the combinatorial ceiling that unit selection voices ultimately must hit. This prompted much interest in the possibility of returning to paramet- ric synthesis, but rather than parametric synthesis with expertly crafted rules to describe dynamic parameter variation, statistical machine-learning techniques could be used to automatically capture the temporal patterning in synthesis parameters. The improvements brought by this effort to synthesize speech are now discussed.
Spring 2022 • Acoustics Today 19