Page 14 - Spring2022
P. 14
FEATURED ARTICLE
Speech Synthesis: Toward a “Voice” for All
H. Timothy Bunnell
Text to speech (TTS) has become so much a part of our everyday lives thanks to Alexa, Google, Siri, and many others that we have come to know (if not always love) that it is difficult to recall a time when it was not so. Syn- thetic voices like those for Siri and others fill multiple roles today. They deliver announcements of important information over public address systems in noisy places like airports where high intelligibility of the speech in noise is crucial to ensure the information they carry is heard correctly. A synthetic voice may be the first entity a customer interacts with when contacting a company and it is important for the voice, as a representative of the company, to present a natural and pleasing voice quality that is representative of the company’s image. Synthetic voices serve as the only voice for individuals whose own voice is lost due to injury or a progressive neurological disease like amyotrophic lateral sclerosis (ALS; also called Lou Gehrig’s disease or motor neuron disease [MND]) or who have a congenital dysarthria due to a condition such as cerebral palsy. And TTS voices allow blind or nonliter- ate users to read content from news stories, books, and computer screens while giving busy people an opportu- nity to “read” email even when driving their car.
A Framework and Baseline for Text to Speech
These current use cases for TTS voices provide insight into the successes of the underlying technology and also highlight areas where work remains. The need for intelligibility, naturalness, and ability to convey an indi- vidual’s vocal identity are obvious from these examples. Less obvious but no less important is the expressiveness of the synthetic speech: the ability to express through intonation or “tone of voice” (Pullin and Hennig, 2015) the intent underlying the words of an utterance.
In this article, I trace how we arrived at the current state of the science for TTS, showing how the technology improved
with the adoption of newer approaches and improved numerical techniques. A natural start is with the work of Klatt (1980) who provided Fortran software for implement- ing a cascade/parallel formant synthesizer. Klatt (1987) provided a history of TTS conversion, which was remark- able for the inclusion of a collection of audio examples for many of the synthesizers he discussed (see Ramsay, 2019, for an interesting review of early mechanical synthesizers).
Crucially, the period around the publication of these two articles by Klatt (1980, 1987) marked an important era in the TTS field. From a purely commercial perspec- tive, it was arguably during this time that TTS systems became commercially mainstream, largely through improvements in the intelligibility of the speech that they generated.
Second, during this period, TTS technology started to be adopted by nonvocal persons to enhance their ability to communicate with others. One of Klatt’s visions for Digital Equipment Corporation’s DECtalk system, which emerged directly from his work at MIT, Cambridge, Massachusetts, was its application in augmentative and alternative communication (AAC) devices for commu- nication by individuals who are nonvocal. Until that time, augmented communicators depended mainly on mechanical communication boards that required communicants to point to words or letters to express themselves. Recently, the field has come to refer to these speech-enabled communication devices as speech-gen- erating devices (SGDs), the term I use in this article.
In this article, I present a framework that captures the struc- ture and function of the TTS advances. Throughout, a goal is to focus on the implications for SGD users’ communication.
Figure 1 provides a unified framework for discussing modern TTS systems. Each block or component in the
©2022 Acoustical Society of America. All rights reserved.
14 Acoustics Today • Spring 2022 | Volume 18, issue 1
https://doi.org/10.1121/AT.2022.18.1.14