Page 26 - Winter2018
P. 26

Deep Language Learning
 Figure 4.“Pronunciation Practice” in Transparent Language’s courseware for Spanish. The student’s speech is analyzed and evaluated by deep learning neural network (DNN)-trained acoustic analyzers. Problematic portions are highlighted in yellow. The student can replay these and compare them with a native speaker. An overall score is shown on the meter (right). Published from Transparent Language, with permission.
 example, a phonetic segment can be decomposed into a clus- ter of acoustic “primitives” that fully distinguish it from oth- er segments in the phonological inventory of the language. Among the most common AFs are “voicing,” “manner of ar- ticulation,” and “place of articulation.” Voicing, a binary fea- ture, refers to whether the vocal folds are vibrating (+) or not (−). For example, in the word pan, the initial consonant [p] is “unvoiced,” whereas the vowel and following consonant [n] are “voiced.” Manner of articulation indicates the mode of articulatory constriction impeding the flow of air through the vocal tract. Examples of manner of articulation catego- ries are “vocalic” (e.g., the vowel in the word “pan”), “stop” (a.k.a. “plosive”) consonant (e.g., [p] in “pan”), or nasal con- sonant (e.g., [n] in “pan”). Place of articulation refers to the locus of maximum vocal tract constriction. In our “pan” ex- ample, the initial consonant, [p], has a “bilabial” (both lips) anterior locus of articulation while the final consonant, [n], is produced with the tongue contacting the alveolar ridge (a central place of articulation). Each speech sound (and by extension, syllables and words) can be represented by an analogous set of articulatory features that varies over time.
In EveryVoiceTM, a native-speaker model based in part on AFs, is compared with the student’s utterance by using DTW in concert with several distance metrics. Those speech sounds more than a certain distance from a native-speaker model are highlighted (Figure 4). The application also provides an “intelligibility score,” which reflects a weighted average of student-native distances across the utterance.
In the future, comparative analyses will likely include a range of time frames (and linguistic levels) such as those as- sociated with the syllable (ca. 200 ms), word (ca. 200-600 ms), and phrase (1-3 s; Greenberg, 1999). Human listeners usually require a second or more of continuous speech to re- liably identify all words spoken (Pickett and Pollack, 1963). This extended listening interval is also often required for au- tomatic systems to achieve optimum performance and may account for the recent popularity of “end-to-end” (ETE) and “sequence-to-sequence” (STS) processing in ASR systems (e.g., Prabhavalkar et al., 2017). ETE and STS systems in- tegrate acoustic, pronunciation, and language models into a single, coherent process and so would likely improve the accuracy of ASR-based CALL.
Higher Level CALL Applications
Learning a foreign language involves more than speaking intelligibly. Grammar and vocabulary must also be mas- tered. Constant practice is key for fluency. Online course- ware encourages the student to speak and listen in a broad assortment of realistic situations. In some applications, the student is prompted to respond with a relevant sentence or two. Software evaluates the response, progressing to more difficult material only after the student has demonstrated mastery of the current lesson.
A recent study using ASR goes beyond pronunciation to of- fer feedback on a variety of language skills, such as gram- mar and syntax, for students of Dutch (van Doremalen et al.,
24 | Acoustics Today | Winter 2018


























































































   24   25   26   27   28