Winter2018

Page 22 - Winter2018

P. 22

Deep Language Learning
(5) cloud-based virtual and augmented reality applications that extend or replace the user’s physical environment through simulation of a variety of situations and envi- ronments.
These, along with advances yet to come, will transform the learning experience, not only for language instruction but also for pedagogy in general.
What is the current state of language learning technology, and where is it heading? Before answering, let’s first review the history of CALL (Bax, 2003).
A Brief History of Computer-Assisted Language Learning
Computers were introduced into language instruction around 1960 to supplement programmed classroom instruction. Al- though the technology was primitive by today’s standards, early CALL projects demonstrated a potential for enhancing the pedagogical experience. One example is the Programmed Logic for Automatic Teaching Operations (PLATO) Project (University of Illinois at Urbana-Champaign), which includ- ed online testing, tutoring, and chat rooms.
Over the years, the quality of CALL improved, driven by advances in interactive media and technology (Warschauer and Healy, 1998). In the 1960s and 1970s, CALL focused on drill and practice lessons in which a computer presented a stimulus and the student responded with (hopefully) the correct response. This was the “structural” (or “restricted”) phase of CALL. Beginning in the late 1970s and extending through the early 1990s, CALL entered its “communicative” phase, which emphasized more natural ways of speaking and listening.
With the advent of the World Wide Web and multimedia technology in the 1990s, CALL entered its “integrative” phase, in which the pedagogy was incorporated into a broad range of communication scenarios representative of daily life. During this time, CALL applications offered graphics, animation, au- dio, and text, all in lessons that combined speaking, listening, reading, and writing (Chapelle and Sauro, 2017).
The key to effective language learning is for the student to use the foreign language as much as possible. Constant practice and feedback is essential. A shortage of language instruc- tors and classroom time makes a compelling case for CALL because it offers instruction anytime, anywhere. Although CALL was originally designed for desktop and laptop com- puters, its future likely lies with smartphones, tablets, and
other mobile devices (e.g., virtual reality [VR] goggles and artificial intelligence [AI]-enabled eyewear).
Computer-Assisted Pronunciation Evaluation and Training
Pronunciation training is where CALL has long deployed cutting-edge technology (Eskenazi, 2009). Several early proj- ects used speech technology to evaluate a student’s fluency, pronunciation proficiency, and comprehension. An example is SRI’s Autograder project in which Japanese students were evaluated on their ability to speak English intelligibly. An al- gorithm was developed to emulate intelligibility judgments of native speakers but lacked remedial feedback. Some of this technology was incorporated into PhonePassTM, an au- tomatic system for evaluating a student’s fluency and profi- ciency in English (Bernstein and Cheng, 2007).
Both academic (e.g., Carnegie-Mellon, Hong Kong, MIT, Nijmegen, KTH Stockholm) and commercial (e.g., Carn- egie Speech, Duolingo, Rosetta Stone®, SRI, Transparent Language®) teams have developed technology that evalu- ates pronunciation using methods adopted from ASR. At first glance, ASR appears a perfect match for CALL. In place of a language teacher, why not leave the tedium of tutoring to an algorithm embedded in the cloud? It’s available 24/7, never tires or sickens, and doesn’t go on vacation. However, ASR-based CALL has its drawbacks. For one, ASR doesn’t classify individual speech sounds with great precision (e.g., Greenberg and Chang, 2000). Like humans, automatic sys- tems don’t decode speech sound by sound but rather rely on clever engineering to infer what the speaker said (or should have said). They do so by culling information from a variety of nonacoustic sources (e.g., location, email, online search- es) to supplement the acoustic signal. Although fortuitous for conventional ASR (e.g., Amazon’s Alexa, Apple’s Siri, and Google Voice), such supplementation can be a serious draw- back for CALL applications. This is due to the uncertainty surrounding the identity of specific speech sounds (a.k.a. “phonetic segments” or “phones”). To better understand the problem, let’s consider a hypothetical example. The word “pan” consists of three phonetic segments represented by the symbols [p], [æ], and [n] (brackets denote individual segments). An ASR system might correctly identify the ini- tial and final consonants ([p] and [n]) but misidentify the vowel [æ], in which case the word initially “recognized” is “pin” rather than (the spoken word) “pan.” However, the vowel’s misclassification would probably be overridden by a
20 | Acoustics Today | Winter 2018

20 21 22 23 24