Page 18 - Spring2022
P. 18

SPEECH SYNTHESIS
the selected units to form the desired output utterance. If the storage format permits, there may be additional adjust- ments to the units during the concatenating process. This could include adjustments such as smoothing potential discontinuities across diphone boundaries, adjusting diphone duration per a timing model, or even adjusting the F0 per an intonation model. Once the diphones have been assembled and concatenated to form an utterance, additional processing, if any, is applied to map from the diphone storage format to a digital audio waveform.
Diphone synthesis held one particularly intriguing possibil- ity for SGD users, the ability to capture an individual’s vocal identity. Because only a small amount of recorded speech is needed to create a diphone inventory, it would be pos- sible to inexpensively mass produce unique diphone voices as long as the process of selecting diphones from record- ings could be automated. People using SGDs could have a unique personal voice by selecting a suitable voice donor to do the recording. Moreover, people diagnosed with a condition such as ALS that threatens the loss of their voice could do the recording themselves and thus “bank” their voice for later use as a synthetic voice in an AAC device. In the mid-1990s, my laboratory at the Nemours Children’s Hospital, Delaware, began experimenting with an extension of diphone synthesis (e.g., Bunnell et al., 1998) that would allow ALS patients to bank their voice in this way, a process referred to as “voice banking.”
Diphone TTS voices, although a promising technology, did not generally gain much traction among AAC device manufacturers or SGD users. The small memory footprint for rule-based formant synthesis was certainly an impor- tant factor in favor of the formant-based TTS voices for AAC manufacturers. Furthermore, diphone TTS voices did capture the vocal identity of the person who recorded the diphone inventory but did not permit expressiveness, particularly for systems that used waveform concatenation, and despite capturing voice quality well, diphone synthesis tended not to flow in a natural manner. Moreover, many of the inexpensive diphone TTS systems available in the
1980s and later were less pleasing to listen to than the DEC- talk voices that were provided with most AAC devices (e.g., see #29 at https://bit.ly/30n0V6V). That changed, how- ever, with the emergence of unit selection TTS systems in the 1990s.
Unit Selection Text-to-Speech Voices
One of the greatest difficulties with diphone synthesis was the impossibility of selecting a collection of diphones that did not suffer from sometimes jarring discontinui- ties at concatenation boundaries. This was less of an issue for diphones stored, as per Dixon and Maxey (1968), in a format that was amenable to substantial adjustments to smooth over or entirely eliminate disjunction by inter- polating smoother parameter trajectories at segment boundaries. However, the highest voice quality obtainable from diphone synthesis was for diphones stored as wave- form data or equivalently prewindowed PSOLA epochs. Unfortunately, with waveform concatenation and other issues, notably jarring differences in spectral features, F0, and amplitude at diphone boundaries were common.
These issues with waveform concatenation were largely addressed by an extended approach called “unit selection” (e.g., Zen et al., 2009) wherein a large amount of speech from a single individual is recorded and segmented into units that could be diphone size or smaller. This approach is illustrated in Figure 4 using the word two as the target utterance and assuming each unit is roughly half of a phoneme. The units are stored along with additional features describing the linguistic details of the phoneme or waveform region from which they were drawn, such as the type of word (function vs. content word), syl- lable stress, syllable location, phrase location, presence and type of pitch accent on the associated syllable, and boundary level for the associated syllable. Because a unit selection database may contain a large number of can- didates for each possible unit, there is a much greater chance of finding one or more units that exactly or nearly match the intended output context along all of the coded linguistic dimensions. Moreover, in the process of select- ing units for concatenation, it is possible to select the specific candidates that will also minimize spectral dis- continuities or sudden jumps in F0 or other factors that cannot be indexed as specific linguistic features.
Unit selection voices came to dominate the commercial TTS voice market in the late 1990s and 2000s because they are much more natural-sounding and intelligible than other commercially available TTS voices. Sometime in the 2000s, most SGD manufacturers included at least a few unit selection voices in their products. Moreover,
 18 Acoustics Today • Spring 2022

























































































   16   17   18   19   20