Spring2022

Page 17 - Spring2022

P. 17

information that would otherwise need to be expanded from rules. Or, as seen in Statistical Parametric Speech Synthesis, the corpus can be used to automatically dis- cover the rules through machine-learning algorithms so that no expert is needed. Thus, the rules needed for the phonetics to parameters component of a formant synthesis system required expert knowledge of acoustic phonetics and a lot of hard work. However, corpus-based systems were able to replace much of that work by simply storing the data that would otherwise need to be devel- oped from rules.
As illustrated in Figure 2, diphone synthesis (and related “concatenative” methods) follows a slightly different path
within our overall TTS model.
A diphone is the region of speech spanning roughly the middle of one phoneme to the middle of the next pho- neme. Figure 3 illustrates this using the word “bob.” The initial and final /b/ segments are relatively stable as is the /a/ vowel near its center. However, the acoustic structure changes rapidly around the borders between the con- sonants and the vowel. As long as the phoneme centers are reasonably similar across different phonetic contexts (they really are not, but we are assuming that they are close enough!), then cutting speech up into diphone-sized units ought to allow one to concatenate the diphones in novel ways to produce nearly any utterance. For example, take the [ba] from [bab] and the [at] from “cot” [kat] to create
“bought” [bat]. This was the insight that led Dixon and Maxey (1968) to develop a formant diphone synthesizer (see #18 at https://bit.ly/3qxs3uL) that used stored formant synthesis parameters rather that a rule system to generate the parameters prior to synthesis.
Formant synthesis parameters are an interesting choice for the diphone storage because they have several useful properties. (1) They do not require a large amount of storage (a factor that was especially important in 1968!). (2) They are orthogonal, that is, it is possible to change any one parameter value without impacting the values of other parameters. (3) Interpolation between values for any parameter will yield another valid parameter value.
However, formant synthesis parameter values have not been the most common format for storing diphone units. More commonly, diphones have been stored as linear predictive coding (LPC) coefficients (e.g., see #34
at https://bit.ly/30n0V6V) or as waveform data stored in a format amenable to the fundamental frequency (F0) and duration modification using an algorithm like Pitch Synchronous OverLap Add (PSOLA; Moulines and Charpentier, 1990).
As is often true with speech processing, the most natural sounding of these formats in terms of voice quality would be waveform data because that is the least processed. LPC coding preserves much of the speaker identity informa- tion, but some voice quality may be lost in processing. Formant synthesis generally produces the least natural- sounding audio. Unfortunately, waveform data are the least compact storage format and also the most difficult to work with in that they afford little opportunity to adjust for discontinuities at diphone boundaries.
The phonetics to stored units (Figure 2, blue) is the path taken from the text to phonetics component for diphone synthesis. There are a relatively small number of diphones for any language. For example, Dixon and Maxey (1968) based their inventory on a total of 41 phonemes, so a theoretical maximum of 412 = 1,681 possible diphones. Consequently, the conversion from phonetics to stored units amounts to simply looking up the needed sequence of diphone units.
The selected diphone units can then be passed to the concat- enate units (Figure 2, blue) component that concatenates
Figure 3. Illustration of phonemes versus diphones. Top, spectrogram of the word bob. Dark bands, regions of high energy, corresponding to formants. Middle, acoustic waveform. Bar below waveform, phoneme locations ([b], [a], and [b]). Bottom bar, locations of the two diphone regions ([ba] and [ab]).
Spring 2022 • Acoustics Today 17

15 16 17 18 19