Page 51 - Fall2019
P. 51
The supramodal thesis also seems compatible with the modal flexibility of the brain (e.g., Rosenblum et al., 2016). As stated, auditory brain areas respond to visual, and even haptic, speech (for reviews, see Treille et al., 2014; Rosenblum et al., 2016).
The supramodal account may also help explain some gen- eral commonalities observed across the auditory and visual speech functions. First, sine-wave and point-light speech show that the brain can use dynamic, time-varying infor- mation in both modalities (Remez et al., 1997; Rosenblum et al., 2002). A second general commonality is that talker information can interact with, and even inform, speech per- ception in both modalities. As stated, we perceive speech better from familiar talkers whether listening or lipreading, despite having little formal experience with the latter. There is also neurophysiological evidence for a single brain area that brings together audiovisual talker and phonetic information (e.g., von Kriegstein et al., 2005).
We Always Imitate
There is third general way in which auditory and visual speech perception are interestingly similar; they both act to shape the phonetic details of a talker’s response. During conversation, talkers will inadvertently imitate subtle aspects each other’s speech intonation, speed, and vocal intensity (e.g., Giles et al., 1991). Talkers will also subtlety imitate one another’s more microscopic aspects of speech: the phonetic details (e.g., Pardo, 2006). These details include vowel quality (e.g., Pardo, 2006) and the talker-specific delay in vocal cord vibration onset for segments such as “p” (Shockley et al., 2004).
This phonetic convergence not only occurs in the context of live conversation but also in the lab when participants are asked to listen to words and then simply say the words they hear out loud (e.g., Goldinger, 1998). Despite never being asked to explicitly mimic, or even “repeat,” participants will inadvertently articulate their words in a manner more similar to the words they hear. There are a number of possible rea- sons for phonetic convergence including facilitation of social bonding (e.g., Pardo, 2006); easing speech understanding when faced with background noise (Dorsi et al., in prepara- tion); and/or a by-product of the known link between speech perception and production (e.g., Shockley et al., 2004).
Importantly, there is now evidence that phonetic convergence can be induced by visible speech in perceivers with no formal lipreading experience (Miller et al., 2010). Visible speech can also enhance convergence because research shows that having
visual as well as audible access to a talker’s articulations will increase one’s degree of imitation (e.g., Dias and Rosenblum, 2016). Finally, evidence shows that audible and visible speech integrate before inducing convergence in a listener’s produced speech (Sanchez et al., 2010).
The fact that both auditory and visual speech behave similarly in inducing convergence is consistent with the neuroscience. As intimated, one explanation for convergence is the hypoth- esized connection between speech perception and production
(e.g., Shockley et al., 2004). Convergence may partly be a by-product of the speech production system being enlisted for, and thus primed by, perception of the idiosyncrasies of a perceived word. The question of motor system involve- ment in speech perception has been ongoing since the 1960s (for a review, see Fowler et al., 2015). Although it is unclear whether motor involvement is necessary or just facilitatory (e.g., Hickok et al., 2009), it is known that speech motor brain areas are typically primed during speech perception (for a review, see Rosenblum et al., 2016).
Importantly, it is also known that motor areas of the brain are primed during visual speech perception regardless of one’s formal lipreading experience (e.g., Callan et al., 2003). Motor brain involvement also seems enhanced when perceiving audiovisual versus audio-alone or video-alone speech (e.g., Callan et al., 2014; but see Matchin et al., 2014). This find- ing is consistent with the enhanced phonetic convergence observed for audiovisual, versus audio speech (e.g., Dias and Rosenblum, 2016).
Thus, both the behavioral and neurophysiological research reveal a commonality in the ability of auditory and visual speech information to induce a convergent production response. This characteristic joins time-varying and talker- relevant dimensions as general forms of information commonalities across the modalities. These facts, together with the close correlations between the detailed visible and acoustic dimensions, provide support for the speech brain being sensitive to a supramodal form of information.
Future Questions
The supramodal account proffers that much of multisensory speech perception is based on a speech function sensitive to higher order information that takes the same form across modalities. Although this may seem an unconventional theory of multisensory perception, we believe that it is consis- tent with much of the behavioral and neurophysiological data.
Fall 2019 | Acoustics Today | 51