Page 25 - Winter Issue 2018
P. 25

.\‘Illt/(’I1{ l(‘(I('/1(’I'
i _ models (Yu and Li, 2015) can number in the thousands,
Spccch llcprescmauon Spccch rcprcjwnlatlon making neural network training especially challenging, es-
extraction extraction . . . . . .
pecially for limited amounts of training material.
,i .
.2“ _ -9 ~  --
’ :1-‘ PE :23 ft The training of the neural network comes in three basic
l forms (Bishop, 2006): (1) “supervised,” in which the train-
D)’“‘‘"“‘3 "m0 ‘WVPWE ing data are explicitly labeled (as in Figure 1), preferably
-1- with time boundaries for segments and words; (2) “unsu-
/\lignmcnt—bascd . ,, . . . . . .
rcmurc extraction pervised: in ‘Wl11Ch noneiof the training material is labeled;
' and (3) semisupervised, in which a portion of the training
Support Vecm, regression is supervised and used for training on unlabeled data.
v For many years, supervised training was the norm despite it
P""”””"f"""” 51'0"" being burdensome and expensive. Unsupervised and semis-
upervised training are becoming more popular as DNNs in-
Figure 3. An earhl version of Lee’s pronunciation evaluation system crease in sophistication.
(see Lee, 201 6 for the complete system). After transforming the wave-
form into a speech representation, the system aligns the two utter— Deep Learning Neural Networks
ances via DTWand then extracts al1gnment—based features from the .

. . . . The architectures of deep neural networks are more complex
aligned path and the distance matrix. A support vector regression _ _
analysis (a form of optimization) is used for predicting an overall (and Powerful) than e1a5S1ea1ANN5 as the result of the“ en"
pronunciation score. Reprinted from Lee and Glass (2013), with per— hanced C0m1eCtiVitY across time and (acoustic) frequency-
mission. This power is often augmented with “long short-term mem-
  ory” (LSTM; Schmidhuber, 2015) and “attention” (Chorows-
DTW and DNNs to pinpoint pronunciation errors and offer ki et al., 2015) models, which further enhance performance.
remedial feedback. Lee’s (2016) study is especially instruc- Goodfellow et al. (2016) is an excellent, comprehensive in-
tive. It uses DTW, alignment of the student's speech with troduction to deep learning and related topics.

a flative Speekér moifl’ as “geuhas madiinellfgarfing t_° flai Neural networks trained for language instruction have an
Elepmnunclatlons ( lgure S ows a Slmp I e verslon 0 inherent advantage over those designed for speech dictation
elr System)‘ and search (e.g., Alexa, Google Voice, Siri) in that the les-
The more similar a student's speech is to the native-speaker son material is often scripted, with most of the words (and
model, the more successful the DTW-based alignment is their sequence) known in advance. This knowledge makes
likely to be. Instances where the alignment falters or shows it somewhat easier to infer which speech sounds have been
anomalies are flagged as potential mispronunciations. The spoken and in what order (through a pronunciation diction-
system also ascertains the specific form of error (substitu- ary). However, this advantage is counterbalanced by the lim-
tion, deletion, or insertion). The benefit of this approach is ited amount of data available to train CALL DNNs as well
its simplicity and adaptability to a broad variety of languages as the diverse (and often unusual) ways students pronounce
without the need for extensive customization. foreign material.
Deep Linguistic Anaiysis i3NNs have been used in the past where the amount of train-
. . . . ing material is less than ideal (for classical ANNs). However,
In classical ASR, a neural network is trained to recognize _ _ _ _ _
. . even DNNs require a minimum amount of training data to
each of the dozens of different speech sounds (phones) in _ _ _
. . . . . succeed. For this reason, alternative strategies are used to
the phonological inventory of a language. Using a dictionary _ _
. , . . compensate for the paucity of data. A popular approach is to
lookup, its theoretically possible to represent thousands of _ _ _
. . . . reduce the number of training categories from several dozen
words using just a few dozen symbols (which partially over- t h df lb _ t t t_ h
lap with the 26 characters of the Roman alphabet). However, 0: int u Y uslig : mtore C221; acstrep resegggzlf n we as
several dozen is still a large number with which to train a ar lcu a my-acous lc ea mes S’ evens’ '
neural network, particularly when preceding and following What are AFs? They are acoustic models that are based on
phonetic contexts are considered. Such context-dependent how speech is produced by the articulatory apparatus. For
Winter 2018 | Acuuseics Thday | 2::

   23   24   25   26   27