Page 21 - Spring2022
P. 21
Figure 5. Deep neural network (DNN) TTS pipelines emerging in current research efforts from Figure 1.
A large number of current end-to-end neural TTS systems follow the path from initial text processing through text to parameters and thereafter to a param- eters to sound component. In some cases, “text” is taken somewhat broadly to refer to both literal words or char- acters, or to a form in which standard word spellings are replaced with something like International Phonetic Alphabet (IPA) characters to resolve letter to sound ambiguity. This is particularly helpful for languages like English that have borrowed words from many other languages and also helps when building multitalker and multilanguage systems. Most systems on this path generate Mel-scaled spectrograms as the output of the text to parameters component, relying on one of several vocoder methods (e.g., Griffin and Lim, 1984) or DNN- based vocoders, for generating audio output from the Mel-scaled spectrograms without explicitly applying a source/filter model. (Note: the Mel scale is a perceptu- ally motivated transformation of linear frequency to a scale with approximately equal pitch steps; see Stevens et al., 1937.) However, a few systems may also generate parameters for alternative vocoders such as the WORLD vocoder (Morise et al., 2016). Although no systems are presently doing this, output in terms of formant synthesis parameters is also conceivable, with the final parameters to sound component being a formant synthesis vocoder.
Finally, as the ultimate end-to-end DNN TTS approach there is the path from initial text processing through TTS directly to audio output. This is a system referred to as end-to-end adversarial TTS (EATS) by Donahue et al. (2020; see https://bit.ly/3wpQBGR for audio examples). There is nothing before the audio generation but a light text-processing stage to handle tokenization and text normalization, perhaps with an additional substitution
of IPA word spellings instead of standard word spell- ings. The system is complex and requires a very large data corpus and much computer time to train, but their examples illustrate output that is virtually indistinguish- able from human speech. Unfortunately, expressiveness remains a challenge for this technology. Neural TTS sys- tems can learn to express anything that is present in their training data but generalizing beyond seen expressive modes is an area of active ongoing research (e.g., Skerry- Ryan et al., 2018; see examples at https://bit.ly/30epgeW).
Neural TTS systems come at substantial expense both in terms of the amount of data that is needed and in the computational resources to train the models. Many are currently so resource heavy that they are only usable by well-equipped industry or university laboratories. How- ever, there are elements of this work that are already having an impact, notably the neural vocoder programs, which produce highly natural-sounding speech output given the correct input. It may take a very large amount of data and heavy server load to train these vocoders, but once trained, they can be used with Mel spectrograms generated by many other applications and are able to run in real time on desktop-grade computers.
Conclusions
The path from rule-based formant synthesis in the 1980s to the DNN voices being studied in research laboratories today represents significant growth in TTS technology. This growth has been followed through the lens of how the improvements impact one of the potentially most exciting applications of TTS technology: its potential to provide unique personal voices for people who are unable to com- municate vocally without assistance. A notable subset of the potential users of TTS technology are those whose
Spring 2022 • Acoustics Today 21