Page 26 - Winter Issue 2018
P. 26

Deep Language Learning
Mi cumpleafios es mafiana. “°""”
My birthday is tomorrow.
Name Speaker
E ‘ ‘ ‘ ‘ i L},...
Vou A
 
h W M "i Recordyom version
 a

Figure 4. “Pronunciation Practice” in Transparent Languages courseware for Spanish. The students speech is analyzed and evaluated by deep
learning neural network (DNN)—trained acoustic analyzers. Problematic portions are highlighted in yellow. The student can replay these and
compare them with a native speaker: An overall score is shown on the meter (right). Published from Transparent Language, with permission.
example, a phonetic segment can be decomposed into a clus- In the future, comparative analyses will likely include a
ter of acoustic “primitives” that fully distinguish it from oth- range of time frames (and linguistic levels) such as those as-
er segments in the phonological inventory of the language. sociated with the syllable (ca. 200 ms), word (ca. 200-600
Among the most common AFs are “voicing,” “manner of ar- ms), and phrase (1-3 s; Greenberg, 1999). Human listeners
ticulation,” and “place of articulation.” Voicing, a binary fea- usually require a second or more of continuous speech to re-
ture, refers to whether the vocal folds are vibrating (+) or not liably identify all words spoken (Pickett and Pollack, 1963).
(—). For example, in the word pan, the initial consonant [p] This extended listening interval is also often required for au-
is “unvoiced,” whereas the vowel and following consonant tomatic systems to achieve optimum performance and may
[11] are “voiced.” Manner of articulation indicates the mode account for the recent popularity of “end-to-end” (ETE) and
of articulatory constriction impeding the flow of air through “sequence-to-sequence” (STS) processing in ASR systems
the vocal tract. Examples of manner of articulation catego- (e.g., Prabhavalkar et al., 2017). ETE and STS systems in-
ries are “vocalic” (e.g., the vowel in the word “pz_in”), “stop” tegrate acoustic, pronunciation, and language models into
(a.k.a. “plosive”) consonant (e.g., [p] in “pan”), or nasal con- a single, coherent process and so would likely improve the
sonant (e.g., [n] in “pan”). Place of articulation refers to the accuracy of ASR-based CALL.

locus of maximum vocal tract constriction. In our “pan” ex-

ample, the initial consonant, [p], has a “bilabial” (both lips) Higher Level CALL Applications

anterior locus of articulation while the final consonant, [11], Learning a foreign language involves more than speaking
is produced with the tongue contacting the alveolar ridge intelligibly. Grammar and vocabulary must also be mas-
(a central place of articulation). Each speech sound (and tered. Constant practice is key for fluency. Online course-
by extension, syllables and words) can be represented by an ware encourages the student to speak and listen in a broad
analogous set of articulatory features that varies over time. assortment of realistic situations. In some applications, the
In Everyvoicen.’ anative_sPeaker model based in part on AFS) student is prompted to respond with a relevant. sentence or
is compared with the student’s utterance by using DTW in two" Software évaluates the response’ Pmgressmg to more
concert with several distance metrics. Those speech sounds dlfficult matenal only after the student has demonstrated

. . . mastery of the current lesson.

more than a certain distance from a native-speaker model

are highlighted (Figure 4). The application also provides an A recent study using ASR goes beyond pronunciation to of-
“intelligibility score,” which reflects a weighted average of fer feedback on a variety of language skills, such as gram-
student-native distances across the utterance. mar and syntax, for students of Dutch (van Doremalen et al.,
24 | Acnueeice Thday | Winter 2018


















































   24   25   26   27   28