Page 27 - Winter2018
P. 27
2016). A European project, the “Spoken CALL Shared Task” (Baur et al., 2017), offers an illustration of how online evalu- ation and feedback may operate in the future. The competi- tive task was based on data collected from a speech-enabled online tool used to help young Swiss German teens practice skills in English conversation. Items were prompt-response pairs, where the prompt is a piece of German text and the response is a recorded English audio file. The task was to “accept” or “reject” responses that may or may not be gram- matically and linguistically correct. The task involved more than conventional ASR because it also involved the ability to discern semantically and grammatically appropriate re- sponses using natural language processing. The winning en- try (from the University of Manchester, UK) used an ASR system trained with DNNs.
A somewhat different approach is used by the “Virtual Lan- guage Tutor” (Wik, 2011), which is an embodied conversa- tional agent that can be spoken to and that, in turn, can talk back (via speech synthesis) to the student. The agent guides, encourages, and provides feedback for mastering a foreign language (initially, Swedish).
The Future of CALL
Several trends in language-learning software are worth not- ing. Most will likely be enabled through some form of deep learning, among which are the following:
Games
The app FluentU uses real-world video containing music, video, movie trailers, news, and inspiring talks and turns them into personalized language-learning lessons. Lingo- Arcade, Mindsnacks, and DigitalDialects are just a few of the online sites for learning a foreign language using similar material, all within a game-based structure. Su et al. (2013) illustrate several ways to “gamify” dialogue learning for lan- guage learning.
Virtual Language Learning
Applications such as ImmerseMe and Mondly place the stu- dent in simulated, real-life scenarios, such as a bakery or res- taurant, where language skills can be practiced in an engag- ing way. In these apps, ASR evaluates the student’s responses and offers feedback.
Intelligent Language Tutors
Applications such as Duolingo are starting to use “chatbots” to interact with students on a variety of topics to enhance vocabulary and grammar skills. These bots are driven by a
combination of ASR, natural language processing, and other forms of artificial intelligence to guide the student through language lessons in naturalistic settings.
Automatic Language Translation
A Defense Advanced Research Projects Agency (DARPA)- funded project, TransTac (Bach et al., 2007), was an early, albeit limited, attempt to provide automatic language trans- lation in a handheld box (for deployment in the Middle East). Among the languages offered were Iraqi Arabic, and Dari. Waverly Labs sells the PilotTM, an earbud-enabled app that performs simultaneous translation in near real time for over a dozen languages. Google Translate offers the ability to translate from one language to another. Among the lan- guages offered for paired translation are English, French, German, Italian, Portuguese, Russian, and Spanish. Google also provides an optical version (using a smartphone cam- era) that translates signs and other text into one’s native lan- guage. Microsoft has demonstrated simultaneous transla- tion between English and Mandarin Chinese powered by a DNN that can meld the speaker’s voice characteristics with the translated speech. These applications are not especially useful (yet) because they lack the semantic precision and emotional nuance emblematic of human communication, so are best reserved for simple scenarios such as grocery shop- ping and sightseeing.
Speech Synthesis
The quality and naturalness of speech synthesis has greatly improved, largely due to the ability of DNNs to simulate voices with realism. Baidu’s Deep Voice (Arik et al., 2017), Amazon’s Polly, Microsoft’s Cortana, and Google’s Cloud Text-to-Speech (TTS) applications all use DNNs. Google of- fers TTS in a dozen languages. Deepmind’s Wavenet (van den Oord et al., 2017) offers highly realistic synthesis for English and Japanese in multiple voices.
Voice Conversion
Speech synthesis has improved to the point where it is now possible to transform or meld the voice characteristics of one talker into another while preserving intelligibility. Cur- rent state-of-the-art systems (Toda et al., 2016) use a special- purpose Vocoder (e.g., STRAIGHT, Kawahara et al., 1999; WORLD, Morise et al., 2016) as the synthesis engine. Two of the more advanced voice conversion systems use DNNs, which include long short-term memory (LSTM)-based re- current neural networks (Sun et al., 2015) or sequence-to- sequence learning (Miyoshi et al., 2017).
Winter 2018 | Acoustics Today | 25