Page 23 - Winter Issue 2018
P. 23


Automatic;—> n  ay  n  s  eh   flan: n ::tc§E t Euw  lh  r  ly Etc.  t  uw :~<:—/lutomatic
Q” Manua\—5—>n l ay ‘I n I’ s E eh 'lv‘:Ih5 n It;‘ t luw ‘I ch lr ‘My it it ': uw ‘:4~Manual
3 '*mu:"“"l”i ‘
E H "l‘ i T . MW!‘
M. 0
‘:1 ° 3 . . . , .

§ 5..
0.0 0.2 0.4 0.5 0.8 Yima  L0 1.2 1.4 1.6
Figure 1. Alignment of spoken material ( “nine,” “seven,” “two,” “three,” “two”) from the Oregon Graduate Institute “Numbers” corpus ( Cole
et al., 1994). Top: phonetic labels are similar to those used in the TIMI T corpus (Zue and Senefi 1988); middle: spectrographic ( time versus
frequency) representation of the speech signal; bottom: speech pressure waveform. The “automatic” labeling and segment boundaries are
analogous to an alignment. The “manual” labels and segment boundaries were provided by a trained phonetician. Reprinted from Chang et
al. (2000), with permission.
“language” model (Chelba and Ielinek, 2000) that factors ASR. But, as Witt (2012) points out, even human listeners
in semantic context and lexical co—occurrence statistics to don’t necessarily agree on the fine-grained quality of pro-
identify the word as “pan.” ASR—based CALL requires oth- nunciation (at the segment level), so why should machines
er strategies to compensate for such phonetic imprecision, be held to a higher standard?
esileclally lllllllllll llstellel illdglllelltS' Howevel’ Sllcll cllllli Despite such caveats, several language programs, including Ro-
Piillslltoly llletlllllls may themselves colllillllllllse the evallli setta Stone and Carnegie Speech’s NativeAccent°”, do offer feed-
atlolls llcclllaci“ back at the word level (using ASR—based models) that students
An early example of ASR—based CALL was the Voice In- have found helpful. NativeAccent" also provides rudimentary
teractive Language Training System (VILTS). A student’s diagrams Of the VOCal apparatus aS part Of its feedback.
Plollllllclalloll was evaluated by colllpllllllg how well the Alternatives to ASR—based CALL compare a student’s pro-
ASR system performed relative to native speakers at a fine- nunciafion to a native SPeaker)S (Or rather) a composite
grained level of analysis. The speech signal was partitioned . . .
into a sequence of basic sounds (“phones” or “phonetic seg- ;i::l:11nl::::;(l) :2: t‘}l1:r::::)oti1:pI:i fi:r::t31i:giI:1:r:h::1£::$:
merits”), and the results of a segment-based ASR system Speech is likely to be_
compared. Human raters graded a portion of the student
material, and these data served as a baseline for normalizing Siicii 3 c°iiiP‘iiis°ii iiiV°iVcs iioiii sigiiai Pi°cessiii8 and
the ASR—based scores. The system did not offer feedback on acoustic aiiaiYsis> and iiiciiidcs iiic i°ii°Wiii8 siePs-
how to improve Pronunciation (Neumeyer et 31” 2000)_ (1) Phonetic (and other forms of) feature extraction based
_ on a range of spectral and temporal properties for
VILTS formed the foundation for another SRI system, Edu- classifying phonetic Segments and/or linguistically
Speak’ (Franco et al. 1999) which evaluates a student’s
_ _ ’ ’ _ relevant elements. The most frequent features are (a) a
Pl°llllll_cl“ll°ll' The system c°lllPl(l‘Se” Sevelalnslagiisl (_l)_ 56g" coarse snapshot (25 ms wide) of the acoustic frequency
llfiillllrzoll “lld l“ltlel(lllg (:_'ll‘“‘ lllll) “(l;§lllllelll ) of llillfillllilal spectrum computed approximately every 10 ms (e.g.,
P one C Seglllell 5 See lgllle ; il llleasllle 0 e lS' Mel Ce stral Fre uenc Coefficients MFCCs ; Davis
tance between a student’s speech and a native-speaker model and Mimelstein? 1983;; (b) a bmadlband frgquency
(l;;lSell llllgely oll tgle Stlmllilltiiloilllllell(ilefillellfiy Silecll“): analysis with relatively fine temporal resolution (a
acorn arisono au oma ica i ne one ic-se men . . .
Elurationspthat takes the studentlsyspeikingprate into acEount- Spectrilgrain as in Figure 1); (qtemlloml dynamlis
_ _ _ _ ’ (velocity [ delta] and acceleration [ double-delta]
and (4) human listener evaluations for calibration. Edu- features) of the Spectmuy filtered Speech waveform
Speak does not require a word transcript but is restricted to (Fumn 1986)’ Phonetiosegment and syllable duration
lllllgllages fol wlllcll ll has been expllcllly ll“lllell' as well as the trajectory of the fundamental frequen-
Other groups (e_g., Witt and Young, 2000) have also de. cy (pitch contour). A system may also “discover” the
ployed ASR for CALL_ Many systems use human listener- most relevant parameters through a process of “feature
based calibration to compensate for the imperfections of S€1€C'Ei0I1” (6-g-1 Li et a1-a 2013)-
Winter 2018 | Acnustics Thday | E1

   21   22   23   24   25