Page 47 - Summer 2018
P. 47
Figure 1. Speech sounds that are acoustically similar (like the first consonants in the words “fin” and “thin”) are visually very distinct. Sounds that are visually similar (like “s” and “z”) are acoustically quite distinct, as seen in the stark differences in the low-frequency region of the spectra (bottom row).
sights of speech is acquired in infancy (Kuhl and Meltzoff, 1982), including the ability to detect whether a silent talking face is speaking the emergent native language of the infant observer (Weikum et al., 2007).
Visual cues are so powerful that it can be hard to suppress them even when they are known to be phony. The most well- known example of this was made famous by a series of ex- periments by McGurk and McDonald (1976) in which the audio and video tracks of a single syllable were mismatched (e.g., the audio of /pa/ combined with the video of /ka/). In this situation, the visual stream will “fuse” with the auditory stream, yielding perception of /ta/, which is intermediate be- tween the two source signals, even if the listener knows she is being tricked. This phenomenon has come to be known as the “McGurk effect,” and there are many different varia- tions that have emerged in the literature (with many exam- ples available online on YouTube, e.g., acousticstoday.org/ mcgurk and acousticstoday.org/mcgurk2). Even when the video is clearly from another talker (including mismatching stimuli from women and men; Green et al., 1991), the effect can be hard to suppress. There is evidence to suggest that auditory-visual fusion can be somewhat weaker in languag- es other than English (Sekiyama and Tohkura, 1993) but for reasons that are not totally understood. Auditory-visual fusion is, however, a much stronger effect for people who have a hearing impairment (Walden et al., 1990), consistent with the relatively greater reliance on lip reading. There are also reports of speech sounds that are affected by touching the face of a talker as well (Fowler and Dekle, 1991) which,
although not a typical situation, demonstrates that we are sensitive to multiple kinds of information when perceiving speech.
Linguistic Knowledge and Closure
The same speech sound can be heard and recognized differ- ently depending on its surrounding context. Words that are spoken just before or just after a speech sound help us inter- pret words in a sensible way, especially when the acoustic signal is unclear. Consider how the word “sheets” is entirely unpredictable in the sentence “Ashley thought about the sheets.” Ashley could have been thinking about anything, and there’s no reason to predict the word sheets. Converse- ly, if there is some extra context for that word, such as “She made the bed with clean...,” then the listener’s accuracy for recognizing “sheets” becomes more accurate, presumably be- cause of the high amount of context available to help figure out what word fits in that spot. That context-related benefit can work both forward (prediction of the word based on previous cues) or backward (recovery of the word based on later information) in time. In either case, we are driven to conclude that the acoustics of the target word “sheets” can be studied to the level of the finest detail, yet we cannot fully explain our pattern of perception; we must also recognize the role of prediction and inference. Taking fragments of perception and transforming them into meaningful percep- tions can be called “perceptual closure,” consistent with early accounts of Gestalt psychology.
There was once a patient with hearing loss who came into the laboratory and repeated the “... clean sheets” sentence back as “She made the bagel with cream cheese.” This was an understandable mistake based on the acoustics, as shown in Figure 2; the amplitude envelopes of these sentences are a close match. If we were to keep track of individual errors, it would seem that “bed” became “bagel,” “clean” became “cream,” and “sheets” became “cheese.” These are errors that make sense phonetically and acoustically. However, it is rather unlikely that the listener actually misperceived each of these three words. It is possible instead that once “bed” became “bagel,” the rest of the ideas simply propagated for- ward to override whatever the following words were. Figure 2 illustrates the mental activity that might be involved in making these kinds of corrections.
A clever example of the listener’s knowledge overriding the acoustics was published by Herman and Pisoni (2003) who replaced several consonants in a sentence with different con-
Summer 2018 | Acoustics Today | 45