Page 34 - Acoustics Today Summer 2011
P. 34
Fig. 5. English word [map] shown in a spectrogram and a corresponding layout of the mel-frequency cepstrum coefficient matrix.
the phonemes in the word are separated by sharp boundaries in the MFCC “spectrogram,” a quality that is no doubt very helpful for the speech recognition pattern-matching which would normally use such a representation.
Other parameters have been used for ASR in the past, e.g., fundamental frequency (F0), energy, zero-crossing rate, and autocorrelation. They are still used in many other speech applications, such as coding and speaker verification. To esti- mate F0, one looks for cues to the periodic vibration rate of the vocal cords in the corresponding speech signal. Of course, speech is never truly periodic, but only quasi-period- ic, as a speaker alters F0 for a myriad of purposes (syntactic structuring, emphasis, tones in a tone language, emotion). The major point of excitation of the VT occurs at vocal cord closure, after which the speech energy decays. So simple peak picking of the speech signal s(n) is a direct way to estimate F0. However, signal processing enhances accuracy; examples include using autocorrelation, which leads to clearer peaks than found in the original s(n). This owes much to the fact that phase is eliminated in the autocorrelation.
Energy, being simply the sum of the square of a sequence of speech samples over a frame, is a basic useful measure for many speech applications, such as voice activity detection. Use of the square operation, rather than some other power, is
justified heuristically and by its use in Parseval’s Theorem (stating that the energy in a signal is completely specified by the squared Fourier transform). Very-low-rate speech coders transmit energy, F0, and the LP multiplier coefficients every frame, at typical rates of 2.4 kbits/s, although modern cell phones also send information about phase as well, leading to higher quality at rates of 8-11 kbits/s.
The zero-crossing rate is a very simple measure of spec- tral prominence. Just by counting the number of times the speech signal changes algebraic sign (e.g., as the air pressure in front of the mouth goes from exceeding the ambient atmospheric pressure to being less), one gets a good estimate of what frequency dominates the energy. For example, a sine wave has two crossings per period.
All the above measures have found utility in speech applications for purposes of data compression, converting a very high bit rate signal sequence into more efficient param- eter sets. In ASR and coding, further processing is possible by the use of delta parameters, i.e., calculating the difference between successive samples in time. Delta modulation coders exploit the fact that most audio signals are dominated by low frequency energy. One still needs to preserve higher fre- quencies, but D/A quantization noise is less when using dif- ferenced parameters. For ASR, we use the difference of suc- cessive frames to represent the velocity of vocal tract move- ments, placing such information (and, in some cases, a dou- ble difference, to model acceleration) into a single vector per frame of data, so as to accommodate the first-order assump- tion of the standard hidden Markov models which provide the typical ASR search method. Yet another example of dif- ferencing is that of Cepstral Mean Subtraction (CMS), in which one may subtract the long-term average spectra from that in each frame, before applying the data to ASR. Just as the similar Dolby processing can suppress tape hiss, CMS attempts to suppress channel and noise characteristics, while preserving the more relevant dynamic aspects of vocal tract movement for speech recognition.
Signal processing in hearing aids
Compression
One of the symptoms of sensorineural hearing loss is a shift in hearing thresholds that produces an overall reduc- tion in loudness, especially for quiet sounds, and an abnor- mal growth of loudness that causes loud sounds to be very loud, and quiet sounds to be inaudible. The loss of dynam- ic range compression provided by a healthy cochlea pro- duces a condition known as recruitment. Recruitment reduces the dynamic range over which hearing-impaired listeners can perceive sound, so listeners with hearing loss may find quiet sounds inaudible and loud sounds painful. Consequently, straightforward linear amplification, making all sounds louder, is not an effective treatment for most patients.
Modern digital hearing aids apply wide dynamic range compression to treat the abnormal growth of loudness due to recruitment. Compression, a form of automatic gain control, amplifies quiet sounds more than loud sounds, reducing the overall dynamic range of the processed sound, and allowing
30 Acoustics Today, July 2011