Page 20 - Spring2022
P. 20
SPEECH SYNTHESIS
Statistical Parametric Speech Synthesis
As with unit selection synthesis, statistical parametric speech synthesis (SPSS)(Zen et al., 2009) requires a sub- stantial corpus of speech data to be used in training its parametric phonetic models. Unlike unit selection syn- thesis, once the training process is completed, however, the original speech waveform data are no longer needed. Instead, the SPSS machine-learning process develops models for the acoustic structure of each phoneme. These models are then able to generate the time-varying param- eters values for the parameters to sound component of the TTS system. Thus, fully trained SPSS models replace hand-coded rule systems in the phonetics to parameters component in Figure 1. In practice, the SPSS models are commonly sets of hidden Markov models (HMMs), one model for each phoneme, that describe the acoustic structure of the phoneme as a sequence of acoustic states, allowing the time-varying trajectories of parameters to be regenerated from the properties of the state sequence.
The parameters the SPSS models learn are typically those describing the time-varying speech source function (voicing or friction) and moment-to-moment spectral features. The parameters to sound or vocoder component then uses the source and spectral parameters to regener- ate audio data via digital filtering.
SPSS synthesis has several advantages over both rule-based formant synthesis and unit selection. First, because the SPSS models for parameter generation can be trained on a corpus of speech from a single talker, the output of the SPSS voice sounds recognizably like the talker who recorded the corpus. Moreover, because the training pro- cess is largely automatic, building multiple personal voices is not especially difficult or labor intensive. Compared with unit selection based on a similar-size speech corpus, particularly for smaller corpora (those having less than four hours of running speech), SPSS voices are not prone to discontinuities at segment boundaries and tend to have more natural-sounding prosodic structure. And because SPSS voices use parametric synthesis, it has the potential for changing characteristics of the voice quality or intro- ducing expressiveness, but this potential is not yet realized.
There are, however, two main drawbacks to SPSS voices. First, the naturalness of the resulting synthetic voice is limited by the ability of the vocoder to reproduce natu- ral-sounding voice quality. Some vocoder output sounds
“buzzy” or “mechanical” when compared with unit selec- tion voice quality. Second, in SPSS, each phonetic model represents an average of the acoustic patterns seen for all instances of the same contextually similar phonetic segment. This averaging tends to obscure some of the natural variability in human speech, leading to more monotonous sounding speech. Often, SPSS systems attempt to compensate for this averaging effect by exag- gerating or boosting the variability of parameters over time. However, once the natural variability is lost due to averaging, it is not really possible to restore it.
Despite these two drawbacks, ACC users of Model- Talker voices have generally had favorable reactions to SPSS voices and the best of the SPSS laboratory TTS systems have been able to produce speech with audio quality closely approaching that of unit selection systems. Any long-term debate about the relative merits of unit selection versus SPSS voices, however, appears to rapidly becoming moot, particularly as it applies to large com- mercial grade TTS voices. This is due to the emergence of new deep-learning models.
Deep Neural Network Speech Synthesis
In the past decade, deep neural networks (DNNs) and deep learning have revolutionized machine learning and led to large-scale improvements in several applica- tion areas. Large improvements have been observed in areas as diverse as speech recognition, machine trans- lation between languages, natural language processing, text summarization, and speech synthesis. Explaining, even grossly, how DNNs function is beyond the scope of this article, but a few examples and consideration of how some models are changing the flow within the TTS system framework shown in Figure 5 may give a reason- able sense of the emerging changes.
In Figure 5, the path from text to phonetics through phonetics to sound is a good place to start because this is the path used by WaveNet (van den Oord et al., 2016), which was one of the first “end-to-end” neural TTS sys- tems. The authors have created an excellent website that describes their work and provides audio examples (see https://bit.ly/3qtNrkm). Training for WaveNet required about 25 hours of speech from a single female speaker and required days of CPU and GPU processing on Google’s servers.
20 Acoustics Today • Spring 2022