Winter2018

Page 29 - Winter2018

P. 29

Miyoshi, H., Saito, Y., Takamichi, S., and Saruwatari, H. (2017). Voice conversion using sequence-to-sequence learning of context posterior probabilities. Proceedings of the International Speech Communication As- sociation (Interspeech 2017), Stockholm, Sweden, August 20-24, 2017, pp. 1268-1272.
Morise, M., Yokomori, F., and Ozawa, K. (2016) WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems 7, 1877-1884.
Neumeyer, L., Franco, H., Digalakis, V., and Weintraub, M. (2000). Automatic scoring of pronunciation quality. Speech Communication 30, 83-93.
Pickett, J. M., and Pollack, I. (1963). Intelligibility of excerpts from fluent speech: Effects of rate of utterance and duration of excerpt. Language and Speech, 6, 151-164.
Prabhavalkar, R., Rao, K., Sainath, T. N., Li, B., Johnson, L., and Jaitly, N. (2017). A comparison of sequence-to-sequence models for speech recogni- tion. Proceedings of the International Speech Communication Association (In- terspeech 2017), Stockholm, Sweden, August 20-24, 2017, pp. 939-943.
Sakoe, H., and Chiba, S. (1978). Dynamic programming algorithm opti- mization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 26, 43-49.
Schmidhuber, J. (2015). Deep learning in neural networks: An over- view. Neural Networks 61, 85-117.
Salvador, S., and Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Journal of Intelligent Data Analysis 11, 561-580.
Stevens, K. N. (2002) Toward a model for lexical access based on acoustic landmarks and distinctive features. The Journal of the Acoustical Society of America 111, 1872-1891.
Su, P.-H., Wang, Y.-B., Yu, T.-H., and Lee, L.-S. (2013). A dialogue game framework with personalized training using reinforcement learning for computer-assisted language learning. Proceedings of the 2013 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, May 26-31, 2013, pp. 8213-8217.
Sun, L., Kang, S., Li, K., and Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, April 19-24, 2015, pp. 4869-4873.
Toda, T., Chen, L.-H., Saito, D., Villavicencio, F., Wester, M., Wu, Z., Yam- agishi, J. (2016). The voice conversion challenge 2016. Proceedings of the International Speech Communication Association (Interspeech 2016), San Francisco, CA, September 8-12, 2016, pp, 1633-1636.
van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukc- uoglu, K., Driessche, G. V. D., Lockhart, E., Cobo, L. C., Stimberg, F., Casa- grande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, T., and Hassabis, D. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. Proceedings
of the 30th International Conference on Neural Information Processing Sys-
tems, Long Beach, CA, December 4-9, 2017.
van Doremalen, J. Boves, L. Colpaert, J., Cucchiarini, C., and Strik, H.
(2016). Evaluating automatic speech recognition-based language learning
systems: a case study. Computer Assisted Language Learning 29, 833-851. Warschauer, M., and Healey, D. (1998). Computers and language learning:
An overview. Language Teaching 31, 57-71.
Wik, P. (2011). The Virtual Language Teacher: Models and Applications for
Language Learning Using Embodied Conversational Agents. Doctoral Dis-
sertation, KTH Royal Institute of Technology, Stockholm, Sweden.
Witt, S. M. (2012). Automatic error detection in pronunciation training: Where we are and where we need to go. Proceedings of the International Symposium on the Automatic Detection of Errors in Pronunciation Train-
ing, Stockholm, Sweden, June 6-8. 2012, pp. 1-8.
Witt, S. M., and Young, S. J. (2000). Phone-level pronunciation scoring
and assessment for interactive language learning. Speech Communication
30, 98-108.
Yu, D., and Li, D. (2015). Automatic Speech Recognition: A Deep Learning
Approach. Springer-Verlag, London.
Zeng, Y. (2000) Dynamic Time Warping Digit Recognizer. MS Thesis, Uni-
versity of Mississippi, Oxford.
Zue, V. W., and Seneff, S. (1988). Transcription and alignment of the TIM-
IT database. Recent Research Towards Advanced Man-Machine Interface Through Spoken Language, pp. 515-525.
BioSketch
Steven Greenberg worked on SRI’s Au- tograder project in the early 1990s. More recently, he has collaborated on the devel- opment of Transparent Language’s Every- VoiceTM technology. He has been a visiting professor in the Center for Applied Hear- ing Research at the Technical University
of Denmark, Kongens Lyngby, as well as a senior scientist and research faculty at the International Computer Science Institute in Berkeley, CA. He was a research professor in the Department of Neurophysiology, University of Wisconsin, Madison, and headed a speech laboratory in the Depart- ment of Linguistics, University of California-Berkeley. He is president of Silicon Speech, a consulting company based in northern California.
Winter 2018 | Acoustics Today | 27

27 28 29 30 31