Page 50 - Fall2019
P. 50
Sensory Modality and Speech Perception
background (Lachs and Pisoni, 2004). These sine-wave speech and point-light stimuli are, no doubt, very odd for perceivers. However, despite lacking what is typically thought of as useful audio and visual information for identifying talkers (e.g., fun- damental pitch, voice timbre, lips and other facial features), these stimuli can convey both usable speech and talker infor- mation (e.g., Remez et al., 1997; Rosenblum et al., 2002).
It is thought that although severely degraded, these stimuli do retain talker-specific phonetic information, including articula- tory style. If true, then the perceivers may be able to match faces to voices by attending to the remaining articulatory-style infor- mation present in these odd stimuli. Moreover, it may be this talker-specific phonetic information, available in both modali- ties, that the perceivers are learning as they become familiar with a talker. If so, then this may help explain the crossmodal talker facilitation effects. The perceivers may become familiar with, and adept at using, talker-specific phonetic information based on experience with one modality. When they are then presented the same talker-specific information in the other modality, they can use that experience to better recognize that talker and what they are saying.
From this perspective, speech learning involves becoming more adept at attending to speech properties that are amodal: articulatory properties that can be conveyed through multi- ple modalities. This is a striking claim that prompts multiple questions. For example, what form might these amodal infor- mational parameters take if they can be conveyed in light and sound? The answer may be what has come to be known as supramodal information.
Supramodal Information
On the surface, auditory and visual speech information seem very different. Although auditory speech information is neces- sarily revealed over time, visual speech is often construed as more spatial in nature (visible lip shapes, jaw position). But research with sine-wave and point-light speech stimuli (along with other work) has revealed another way of considering the information (for a review, see Rosenblum et al., 2016). Recall that both types of stimuli retain only the more global, time- varying dimensions of their respective signals, yet are still effective at conveying speech and talker information. When considered in this way, the salient informational forms in each modality are more similar.
Consider the higher order information for a very common speech production occurrence: reversal of the articulators as
in the production of “aba.” As the jaw and lower lip rise, close the mouth, and then reverse, there is an accompanying reversal in optical (visible) structure (Summerfield, 1987). Importantly, this visible reversal is also accompanied by a reversal in the amplitude and spectral structure of the resultant acoustic signal. Furthermore, the articulation, optic, and acoustic reversals all share the same temporal parameters. Thus, at this level of abstraction, the audible and visible time-varying information takes the same form: a form known as supramodal informa- tion (e.g., Rosenblum et al., 2016). The brain’s sensitivity to supramodal information may account for many multisensory speech phenomena.
Supramodal information may also account for the surprisingly high correlations observed between the signals (e.g., Munhall and Vatikiotis-Bateson, 2004). Detailed measures of facial movements have been shown to be highly correlated with amplitude and spectral changes in the acoustic signal. Part of the reason for the high correlations is the degree to which vis- ible movements can inform about deeper, presumably “hidden,” vocal tract actions. Parameters of vocal intonation (vocal pitch changes) are actually correlated with, and therefore visible through, head-nodding motions. Similarly, the deep vocal tract actions of intraoral pressure changes and lexical tone (vowel pitch changes to mark words, as in Mandarin) can be perceived by novice lipreaders (Burnham, et al., 2000).
The strong correlations between visible and audible speech signals have allowed usable visible speech to be animated directly from the acoustic signal (e.g., Yamamoto et al., 1998) and audible speech to be synthesized from the parameters of visible speech movements (e.g., Yehia et al., 2002).
Returning to speech perception, the supramodal information thesis states that the brain can make use of this higher level information that takes the same form across modalities. In fact, research shows that when supramodal information for a segment is available in both modalities, the speech function seems to take advantage (e.g., Grant and Seitz, 2000). As inti- mated, the supramodal information thesis might help explain how speech and talker learning can be shared across modalities, without bimodal experience. If listening to a talker involves attuning to supramodal talker-specific properties available in the acoustic signal, then later lipreading the talker becomes easier because those same supramodal properties can be accessed by the visual system. A similar conception may help explain multisensory training benefits overall as well as our ability to match talking voices and faces.
50 | Acoustics Today | Fall 2019