Page 14 - Summer 2018
P. 14
Foreign Accent
Figure 2 plots how often listeners identified nonnative speak- ers as nonnative speakers (blue line), and native speakers (erroneously) as nonnative speakers (red line). As shown in Figure 2, accent detection rates were very high with stimuli including phrases, where there are many different opportu- nities for speakers to diverge from native productions. How- ever, in the final condition (rightmost circles), just a portion of one consonant (the stop burst) was still enough to enable listeners to separate native from nonnative speech.
To appreciate just how impoverished the stimuli were, see Figure 3 that illustrates the sequence of stimuli with spectro- graphic images of a native production of one of the phrases used, “Two Little Boys.” The main frame in Figure 3 contains the complex spectral patterning of an entire phrase. The second and third experiments presented stimuli such as the word two, the high-frequency noise of the initial /t/ along with the lower frequency complex of the following vowel be- tween the leftmost and the third red cursor in the panel. In the last experiment, as marked off in Figure 3, two leftmost vertical, dashed red lines, the listeners heard just the 30-ms bit at the very beginning. This bit contains a portion of what is called a burst release, which is the noise associated with the small puff of air that occurs when speakers open their mouths in the production of some consonants such as /t/. Acoustically, this is a short, noisy transient not more than 30 ms in duration. The foreign accent detection rates, even in this extraordinarily impoverished condition, were well above chance (see the difference between the rightmost red and blue circles in Figure 2).
The overall point of these research findings is obvious. If you want to produce speech in a new language that is indistin- guishable from that of native speakers, you have to set a very high standard. There are very tight criteria involved, not just with the production of each vowel but also with the produc- tion of many aspects of the consonants. Listeners are very ac- customed to the speech of their community, and even very small divergences from this speech can be detected by the lis- teners. Also, while not all speech sounds necessarily exhibit appreciable divergences across languages, a good many do, and research on foreign accent has detected many such cases. Where the limits of detectable foreign accent lie, in terms of individual speech sounds, still has not been fully determined.
Problems of a Higher Order
Even as discouraging as this might be for a language learn- er, learning to produce second language speech is actually even more difficult. Differences between languages not only 12 | Acoustics Today | Summer 2018
are found in the consonant and vowel pieces that make up speech but also in the higher order structure of the speech. Speech is not just a string of individual acoustical letters but involves the whole orchestration of these bits into the larger complex signal. Foreign-accented speech, then, also involves the dynamics of the acoustical patterns that arise in the se- quencing of consonants and vowels.
To illustrate this point, I turn away from studies specifically of foreign accent detection to work that has been done on intelli- gibility. Intelligibility refers to measures of how likely a person is able to identify what a person is saying in a recording.
A particularly striking example of this problem of higher- order organization in second languages was demonstrated by a sophisticated study of the effects of dynamic modula- tions in nonnative speech on intelligibility by Tajima et al. (1997). Tajima et al. were puzzling through on how to deal with the fact that speech patterns unfold in time in many ways. The speech signal does not have a fixed timing pat- tern, but timing is modulated by many well-known factors. In the process, they worked with a technique called dynamic time warping, a technique whereby one could take two re- lated acoustic signals with different timing patterns and de- termine a mapping between them. See, e.g., Rabiner et al. (1978) for an early speech foray into dynamic time warp- ing. From this mapping, using various computational tech- niques, one can hybridize the two signals into one with the spectral patterns from one signal and the timing patterns from the other signal.
This mapping pattern from one production to another is illustrated in Figure 4, which is an example from Tajima et al. (1997). The spectral images here have the durational patterns of two productions of the phrase “shelled egg,” a foreign-accented production at the top and an unaccented one at the bottom. Tajima et al. took the timing pattern from nonnative Chinese productions of this and other English sentences and hybridized them with the spectral patterns of native productions of the same sentence. The outcome of this process, then, had the spectral properties of the native speech but the timing patterns of the nonnative speech. He then presented these hybridized recordings to native English listeners with various levels of masking noise and had them write down what they thought the person was saying. The logic of the study was to determine the effect of nonnative timing patterns on intelligibility. As a parallel test, he also took the timing patterns of native productions and hybrid- ized them with the spectral properties of nonnative produc- tions, thereby “correcting” the nonnative timing patterns.