Acoustics Today Summer 2011

Page 29 - Acoustics Today Summer 2011

P. 29

SIGNAL PROCESSING IN SPEECH AND HEARING TECHNOLOGY
Sean A. Fulop
Dept. of Linguistics, California State University, Fresno Fresno, California 93740
Kelly Fitz
Signal Processing Group, Starkey Laboratories Eden Prairie, Minnesota 55344
Douglas O’Shaughnessy
National Institute of Scientific Research, University of Quebec Quebec City, Laval, Montreal, H5A 1K6, Canada
Speech science and technology “Multiband compression is Speech spectrum analysis
would scarcely exist today without
acoustic signal processing. The
same can be said of hearing assistance
technology, including hearing aids and
cochlear implants. This article will high-
light key contributions made by signal
processing techniques in the disparate
realms of speech analysis, speech recog-
nition, and hearing aids. We can cer-
tainly not exhaustively discuss the appli-
cations of signal processing in these
areas, much less other related fields that
are left out entirely, but we hope to pro-
vide at the very least a sampling of the
wide range of processing techniques that are brought to bear on the various problems in these subfields.
While speech itself is an analog signal (or time sequence) of air pressure variations resulting from puffs of air leaving one’s lungs, modulated by the vibrations of one’s vocal cords and filtered by one’s vocal tract, such a vocal signal is nor- mally digitized in most modern applications, including nor- mal telephone lines. The analog-to-digital (A/D) conversion is needed for computer processing, as the analog speech sig- nal (continuous in both time and amplitude), while suitable for one’s ears, is most efficiently handled as a sequence of dig- ital bits. A/D conversion has two parameters: samples/second and bits/sample. The former is specified by the Nyquist rate, twice the highest audio frequency to be preserved in the speech signal, assuming some analog filter suppresses the weaker energy at relatively high frequencies in speech (e.g., above 3.3 kHz in telephone applications, using 8000 sam- ples/s). Like most audible sounds, speech is dominated by energy in the lowest few kHz, but pertinent energy exists to at least 20 kHz, which is why high-quality recordings, such as CDs, sample at rates up to 44.1 kHz. However, speech can be reasonably intelligible even when low-pass filtered to 4 kHz, as the telephone amply demonstrates. Typical speech applica- tions use 16-bit A/D accuracy, although basic logarithmic coding in the telephone shows that 8-bit accuracy can be ade- quate in many applications, which include automatic speech recognition, where the objective is a mapping into text, rather than a high-quality audio signal to listen to or analyze in depth.
the core of modern digital hearing aid signal processing, and is the primary tool for restoring audibility and comfort to patients with hearing loss.”
In phonetics and speech science a commonly pursued aim is to analyze the spectrum of speech as completely as possible, to obtain information about the speech articulation and the specific auditory attributes which characterize speech sounds or “phonemes” (conso- nants and vowels). Spectrum analysis can further our understanding of the variety of sounds in language (a linguis- tic pursuit), and can also further our understanding of the fundamental nature of normal and disordered speech.1
Fourier spectrum and spectrogram
The simplest form of spectrum analysis is the Fourier power spectrum with which readers are probably familiar. Although this is not a function of time, and therefore involves an assumption of stationarity across the signal frame analyzed, it is still of some utility for speech analysis. The Fourier spectrum is particularly useful for examining the spectral characteristics of speech sounds whose steady state is very important, such as fricative consonants (e.g., ‘s’ and ‘sh’). Because such sounds are chiefly noisy, it can be useful to apply statistical techniques such as ensemble averaging to examine their spectra.
The natural extension of the Fourier transform to the time-frequency plane is provided by the short-time Fourier transform (STFT), which, in digital form, is essentially a time-frequency grid of complex numbers. Each frequency column of this matrix at a given time point is the discrete Fourier transform of the analysis window on the signal at that time. The log magnitude of the STFT matrix is tradi- tionally called the spectrogram, and has long been a popu- lar means of examining the spectrum of speech as it changes through time. Figure 1 shows a speech signal waveform, together with a very brief window cut from the signal during the vowel. It is this type of short window, suitably tapered (e.g., by multiplication with a Gaussian function), that can be used to create a spectrogram using power spectra of successive overlapped windows as shown in Fig. 2.
Speech and Hearing 25

27 28 29 30 31