Page 33 - Acoustics Today Summer 2011
P. 33
ciently, and also allow efficient representation of the state of one’s vocal tract in the form of a vector with on the order of ten parameters per frame of speech data. Ten is sufficient, as speech typically has one resonance per kHz and each reso- nance is specified by two complex numbers. Each frame of speech is typically 10 ms in an ASR application, which repre- sents a compromise duration, being long enough to contain a sufficient number of samples, while avoiding averaging out dynamic movements of the vocal tract over long periods. It may be mentioned that LP analysis is frequently used to pro- vide a concise account of the formants for phonetic analysis, which is much less labor-intensive than the spectrum analy- sis procedures outlined in the previous section, but may also be considerably less accurate.
ASR8 is an example of a pattern recognition task, where one maps an auditory object (the speech signal) into a classi- fication (text). This involves data compression, as the initial object is usually represented with an extensive bit sequence, while the output classes are far smaller in number. For speech, typical compressions are many orders of magnitude, from, say, 32 kbits (for a half-second uttered word using tele- phone’s logarithmic pulse-code modulation) down to as little as one bit (e.g., a simple vocabulary of yes versus no), a few bits (for a digit recognition application), or perhaps 10-20 bits (to accommodate the hundreds of thousands of possible words in a given language). Ideally, any data compression would eliminate less useful information, while retaining detail pertinent to discriminate the relevant classes of words in an allowed vocabulary (e.g., assuming that a speaker utters one word at a time). Early ASR simply used the Fourier trans- form spectrum mentioned above, but this does little signifi- cant compression, only allowing us to discard the phase.
The cepstrum is defined as the inverse Fourier trans-
9
form of the power spectrum of input speech. The phase is
discarded as being of little use so far in ASR, as it typically reflects details of three-dimensional air flow in the vocal tract (VT), while ASR is concerned about the shape of the VT, as the latter reflects what sounds and words one is uttering. The VT shape correlates strongly with positions of peaks in the speech spectra, e.g., one’s F1 (first “formant” or resonance) varies directly with tongue height, while F2 varies with front- back tongue location and lip rounding. The logarithmic
amplitude compression relates to the normal nonlinear com- pression that appears in much of human perception, whether touch, sound or vision. The cepstrum is a kind of “spectrum of the spectrum,” and as such it factors out information about the peaks in the power spectrum while providing a set of decorrelated components, called cepstral coefficients, repre- senting the signal. See the above section on reassignment for some new developments on harnessing the phase to improve precision in determining the resonance locations.
For modern ASR, the most common data representation is the mel-frequency cepstral coefficients (MFCC). For ASR purposes, the LP and standard cepstral representations have a weakness in their treatment of all frequencies as equally important, as does the Fourier transform with its fixed band- width, unlike, say, wavelets. The mel scale is a non-linear mapping of physical frequency to a perceptual scale, follow- ing the logarithmic arrangement of frequency “bins” along the basilar membrane in the human cochlea. A priori, it is not obvious that ASR needs to follow such aspects of human hearing, but empirical evidence of superior recognition accu- racy with MFCC10 has led to its acceptance in ASR.
The final step of the MFCC, the inverse Fourier trans- form, allows capturing the essence of the VT shape in as few as 10-16 parameters, as the cepstrum effectively converts convolution to addition. Speech is often viewed as the filter- ing of a glottal waveform by the frequency response of the VT; hence its spectrum is the product of that of the glottis and the VT. As the log of a product is the sum of its log com- ponents, the cepstrum conveniently consists of a linear com- bination of the desired VT component and the undesired glottal one. Furthermore, the former is compressed in the low end (as it represents spectral envelope details, i.e., the reso- nances, which vary slowly in frequency), in terms of the first 10-16 samples, while the latter is at the high end of the cep- strum, as it reflects the harmonics, which demonstrate rapid amplitude variation at multiples of the speaker’s vocal cord vibration rate. As ASR seeks VT information and not glottal detail, the cepstrum is a convenient way to discard the glottal effects by simply employing the first few coefficients. Figure 5 shows a word represented with a spectrogram, and a simi- larly processed image which lays out the mel-frequency cep- stral coefficients along the frequency axis. It is striking how
Meet the acoustic challenges of the modern open office
The Formula 1 in Room Acoustics
www.odeon.dk
Speech and Hearing 29