Page 25 - Winter2018
P. 25
Figure 3. An early version of Lee’s pronunciation evaluation system (see Lee, 2016 for the complete system). After transforming the wave- form into a speech representation, the system aligns the two utter- ances via DTW and then extracts alignment-based features from the aligned path and the distance matrix. A support vector regression analysis (a form of optimization) is used for predicting an overall pronunciation score. Reprinted from Lee and Glass (2013), with per- mission.
DTW and DNNs to pinpoint pronunciation errors and offer remedial feedback. Lee’s (2016) study is especially instruc- tive. It uses DTW, alignment of the student’s speech with a native speaker model, as well as machine learning to flag mispronunciations (Figure 3 shows a simplified version of their system).
The more similar a student’s speech is to the native-speaker model, the more successful the DTW-based alignment is likely to be. Instances where the alignment falters or shows anomalies are flagged as potential mispronunciations. The system also ascertains the specific form of error (substitu- tion, deletion, or insertion). The benefit of this approach is its simplicity and adaptability to a broad variety of languages without the need for extensive customization.
Deep Linguistic Analysis
In classical ASR, a neural network is trained to recognize each of the dozens of different speech sounds (phones) in the phonological inventory of a language. Using a dictionary lookup, it’s theoretically possible to represent thousands of words using just a few dozen symbols (which partially over- lap with the 26 characters of the Roman alphabet). However, several dozen is still a large number with which to train a neural network, particularly when preceding and following phonetic contexts are considered. Such context-dependent
models (Yu and Li, 2015) can number in the thousands, making neural network training especially challenging, es- pecially for limited amounts of training material.
The training of the neural network comes in three basic forms (Bishop, 2006): (1) “supervised,” in which the train- ing data are explicitly labeled (as in Figure 1), preferably with time boundaries for segments and words; (2) “unsu- pervised,” in which none of the training material is labeled; and (3) “semisupervised,” in which a portion of the training is supervised and used for training on unlabeled data.
For many years, supervised training was the norm despite it being burdensome and expensive. Unsupervised and semis- upervised training are becoming more popular as DNNs in- crease in sophistication.
Deep Learning Neural Networks
The architectures of deep neural networks are more complex (and powerful) than classical ANNs as the result of their en- hanced connectivity across time and (acoustic) frequency. This power is often augmented with “long short-term mem- ory” (LSTM; Schmidhuber, 2015) and “attention” (Chorows- ki et al., 2015) models, which further enhance performance. Goodfellow et al. (2016) is an excellent, comprehensive in- troduction to deep learning and related topics.
Neural networks trained for language instruction have an inherent advantage over those designed for speech dictation and search (e.g., Alexa, Google Voice, Siri) in that the les- son material is often scripted, with most of the words (and their sequence) known in advance. This knowledge makes it somewhat easier to infer which speech sounds have been spoken and in what order (through a pronunciation diction- ary). However, this advantage is counterbalanced by the lim- ited amount of data available to train CALL DNNs as well as the diverse (and often unusual) ways students pronounce foreign material.
DNNs have been used in the past where the amount of train- ing material is less than ideal (for classical ANNs). However, even DNNs require a minimum amount of training data to succeed. For this reason, alternative strategies are used to compensate for the paucity of data. A popular approach is to reduce the number of training categories from several dozen to a handful by using a more compact representation such as articulatory-acoustic features (AFs; Stevens, 2002).
What are AFs? They are acoustic models that are based on how speech is produced by the articulatory apparatus. For
Winter 2018 | Acoustics Today | 23