Page 24 - Winter Issue 2018
P. 24

Deep Language Learning

Signal A _ Signal A+ B _ Alignment 1

1 1 I 1 1 --

1 1 /\+/ \ 1 1 (\/ \\ \

1 1 I M», 1 A ’ \_A( / \

(2) “Alignment” of a student’s speech. This produces a rep- ‘W\"‘( | 1 \ 1 / ‘ \
mini
resentation of the signal as a sequence of speech sound 8 P IY CH
labels (e.g., [p], [ae], [n]) along with the start and end Signal B Sign?“ A + B Angnment 2
points of each sound in the speech waveform (these 1 1 1 r 
are approximate markers of where a speech sound is 1 1 1 K; ] \\ ,L/CK
/ ,

likely to begin and end). J:  [mi

(3) A word transcript is often required so that the aligner S _P V CH 3 V V D-I-W Grid
knows in advance the likely speech sounds and their , Op“1na|A,“gnmer.“ , E  1 . .
sequence. The alignment (Figure 1 shows an example) 1 1   1 jg’ ' : : . . . ' .
is based (in part) on acoustic models for each speech 1 A  7 M/, \ 1 ‘E . ’ : E : . .: .
sound, often in the context of the sounds that precede ‘(RT 1  :\1  - - - ' ' 3 - ' ' >
and follow. In some systems, such as EduSpeak°, the 3 P ‘Y CH Signal 3 (WW9)
student’s speaking rate (in segments or syllables per

. . . . Figure 2. A highly simplified illustration of dynamic time warping

second) is estimated as a way of improving the accu- , ,, , ,, , , ,
. _ (DTW) to achieve a fair comparison of two aligned speech signals.
racy of fhe_Ph°net1c.b°undaneS' _ Signals A and B are two instances of the word ‘speech” spoken by two

(4) DYn3m1e nme Warpmg (DTW; S3k0e and Chiba: 1973) individuals. Dynamic time alignment iteratively warps the time axis
is a method for aligning the student’s speech, speech of Signal B until it finds the closest match (in terms of time and spec-
Sound by speech Sound, with a nafive-5Peaker com- trum) of the two signals.. The DTW grid shows a hypothetical time
posite model. DTW adjusts the segment boundaries to W‘:(Pi"<1>7 °c1(§'g"“l f((Tel“t;:e ml 5'32"] A)tt°t“‘hie"e 1“ ‘1“:";3t“?vt‘:7

. . o lma ainmen i.e., ecoses secroem ora mac o e
optlmlze the correspondence between the Student and trio. Signali and B alignment adapteld from Zefng (2000) and DTW
natiVe'SPeaker model so that a “fair Comparison» can Grid adapted from Salvador and Chan (2007), with permission.
be made at the phonetic-segment, syllable, and word
levels (Figure 2). “ _ _ »

(5) A distance metric that quantifies how similar the stu- A Substmmon error woulii be ‘inc Vtihere She Strident Pro;
dent’s speech is to a native speaker (or rather a com- nounces the nngnsn Word land 35 lend ‘fin inserngn
posite model comprising many native speakers). The :"°u1d_ Oifur 1f the Studeniproliounces land as land} 1:
features used for comparison are Primarily Spectral deletion would occur if land ‘were pronounced as lan
but may also incorporate dynamic and other temporal (where the word-final sound [d] is not articulated).
Pr°Perne5- Such departures from the canonical, dictionary pronuncia-

(5) The intrinsic VanabnnY 01: SPeeCh) Pa‘fic“1ar1Y Pmnnn‘ tion are one reason why DTW is frequently used to compute
Ciationi Presents 8 maj0f challenge 1:01‘ CALL teehn01‘ the distance between element X and element Y, where X and
°8Y- T0 5nnPnfY the e°‘nPaTi5°n between the Snldenfs Y may be a word, a phrase, or even longer span of speech
utterance and that of a native-speaker model, the (e_g_) a sentence)
analysis recasts the fine-grained spectral and temporal _ _ _ _ _ _ _

. . . Because pronunciation is inherently variable (without im-
analyses into a form more amenable to quantification. _ _ _ _ _ _ _ _ _
pacting intelligibility), the distance calculation is usually
Because ASR systems have traditionally treated speech as based on a large number (often hundreds or thousands) of
a sequence of short-duration speech sounds (i.e., phonetic signal parameters. Such complexity is then distilled into a
segments), it is this analytical framework that is most often computationally more tractable form using data-reduction
used. However, word-level models are becoming increas- methods such as feature selection (e.g., Iames et al., 2013,
ingly popular and may replace segment models soon. p. 203), principal component analysis (e.g., Iolliffe, 2002,
‘ 1- l tw ks h d
Three types of pronunciation errors account for most of the or, Specla purpose neurn ne or éuc as, autoemfo, as
, , bl , (Liou et al., 2014). The distance metric may include high-
pronunciation pro ems students experience. These mostly ,, , _ , ,
, , , level features such as pronunciation error type, intonation,
occur at the level of individual speech segments (although _ _
_ _ and other pitch properties (e.g., tone level and contour).
some problems pertain to syllable prominence and dura-
tion). In the discussion that follows, a segment error is 1 Using such comparative methods. L66 and Glass (2015) and
derlined to distinguish it from correctly articulated sounds. Transparent Languages EveryVoiceTM technology deploy
22 | Acuuseics Thday | Winter 2018






















   22   23   24   25   26