Page 24 - Winter2018
P. 24
Deep Language Learning
(2) “Alignment” of a student’s speech. This produces a rep- resentation of the signal as a sequence of speech sound labels (e.g., [p], [ae], [n]) along with the start and end points of each sound in the speech waveform (these are approximate markers of where a speech sound is likely to begin and end).
(3) A word transcript is often required so that the aligner knows in advance the likely speech sounds and their sequence. The alignment (Figure 1 shows an example) is based (in part) on acoustic models for each speech sound, often in the context of the sounds that precede and follow. In some systems, such as EduSpeak®, the student’s speaking rate (in segments or syllables per second) is estimated as a way of improving the accu- racy of the phonetic boundaries.
(4) Dynamic time warping (DTW; Sakoe and Chiba, 1978) is a method for aligning the student’s speech, speech sound by speech sound, with a native-speaker com- posite model. DTW adjusts the segment boundaries to optimize the correspondence between the student and native-speaker model so that a “fair comparison” can be made at the phonetic-segment, syllable, and word levels (Figure 2).
(5) A distance metric that quantifies how similar the stu- dent’s speech is to a native speaker (or rather a com- posite model comprising many native speakers). The features used for comparison are primarily spectral but may also incorporate dynamic and other temporal properties.
(6) The intrinsic variability of speech, particularly pronun- ciation, presents a major challenge for CALL technol- ogy. To simplify the comparison between the student’s utterance and that of a native-speaker model, the analysis recasts the fine-grained spectral and temporal analyses into a form more amenable to quantification.
Because ASR systems have traditionally treated speech as a sequence of short-duration speech sounds (i.e., phonetic segments), it is this analytical framework that is most often used. However, word-level models are becoming increas- ingly popular and may replace segment models soon.
Three types of pronunciation errors account for most of the pronunciation problems students experience. These mostly occur at the level of individual speech segments (although some problems pertain to syllable prominence and dura- tion). In the discussion that follows, a segment error is un- derlined to distinguish it from correctly articulated sounds.
Figure 2. A highly simplified illustration of dynamic time warping (DTW) to achieve a “fair” comparison of two aligned speech signals. Signals A and B are two instances of the word “speech” spoken by two individuals. Dynamic time alignment iteratively warps the time axis of Signal B until it finds the closest match (in terms of time and spec- trum) of the two signals.. The DTW grid shows a hypothetical time warping of Signal B (relative to Signal A) to achieve a quantitatively optimal alignment (i.e., the closest spectrotemporal match) of the two. Signal A and B alignment adapted from Zeng (2000) and DTW Grid adapted from Salvador and Chan (2007), with permission.
A “substitution” error would be one where the student pro- nounces the English word “land” as “lend.” An “insertion” would occur if the student pronounces “land” as “lands.” A “deletion” would occur if “land” were pronounced as “lan” (where the word-final sound [d] is not articulated).
Such departures from the canonical, dictionary pronuncia- tion are one reason why DTW is frequently used to compute the distance between element X and element Y, where X and Y may be a word, a phrase, or even longer span of speech (e.g., a sentence).
Because pronunciation is inherently variable (without im- pacting intelligibility), the distance calculation is usually based on a large number (often hundreds or thousands) of signal parameters. Such complexity is then distilled into a computationally more tractable form using data-reduction methods such as feature selection (e.g., James et al., 2013, p. 203), principal component analysis (e.g., Jolliffe, 2002, or special-purpose neural networks such as autoencoders (Liou et al., 2014). The distance metric may include “high- level” features such as pronunciation error type, intonation, and other pitch properties (e.g., tone level and contour).
Using such comparative methods, Lee and Glass (2015) and Transparent Language’s EveryVoiceTM technology deploy
22 | Acoustics Today | Winter 2018