Page 22 - Spring2022
P. 22

SPEECH SYNTHESIS
speech is at risk of being lost due to disease or injury. For those users, the ability to bank their existing speech for its use later in as a personal TTS voice of the quality now emerging form the laboratory is a highly promis- ing prospect.
We initially identified four features that seem to be of greatest importance to users for assistive voice technol- ogy: intelligibility, naturalness, identity, and expressivity. Of these four, the first three are essentially solved prob- lems, at least for laboratory-grade neural TTS systems. Given the rate of progress with the technology, it seems likely that for these three features, medical and consumer applications will not be long in coming. Expressivity, however, remains the largest unsolved issue for TTS sys- tems. Parametric synthesis affords the ability to control features known to relate to expressive modes of speaking, and it will be fascinating to see how natural language processing (NLP) may end up helping users quickly find the right emotion to convey along with their text when it is spoken aloud.
References
Bunnell, H. T., Hoskins, S., and Yarrington, D. (1998). A biphone con- strained concatenation method for diphone synthesis. Proceedings of the Third International Workshop on Speech Synthesis, Jenolan Caves, Blue Mountains, NSW, Australia, November 26-29, 1998, pp. 171-176.
Bunnell, H. T., Pennington, C., Yarrington, D., and Gray, J. (2005). Automatic personal synthetic voice construction. Proceedings of the Eurospeech 2005, Lisbon, Portugal, September 4-8, pp. 89-92.
Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., and Simonyan, K. (2020). End-to-end adversarial text-to-speech. Available at https://bit.ly/3C7jVmI. Accessed November 12, 2021.
Greene, B. G., Logan, J. S., and Pisoni, D. B. (1986). Perception of syn- thetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems. Behavior Research Methods, Instruments, & Computers 18(2), 100-107.
Griffin, D., and Lim J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech and Signal Pro- cessing 32(2), 236-243. https://doi.org/10.1109/TASSP.1984.1164317.
Klatt, D. H. (1980). Software for a cascade/parallel synthesizer. The Journal of the Acoustical Society of America 67, 971. https://doi.org/10.1121/1.38940.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America 82, 737-793.
Morise, M., Yokomori, F., and Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems E99- D(7), 1877-1884.
Moulines, E., and Charpentier, F. (1990). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467.
Murray, I. R., and Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America 93(2), 1097-1108.
Pullin, G., and Hennig, S. (2015). 17 ways to say yes: Toward nuanced tone of voice in AAC and speech technology. Augmentative and
Alternative Communication 31(2), 170-180.
Ramsay, G. (2019). Mechanical speech synthesis in early talking
automata. Acoustics Today 15(2), 11-19.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman,
C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. Proceedings of the 2nd Inter- national Conference Spoken Language Processing, Banff, AB, Canada, October 13-16, 1992, pp. 867-870.
Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R., Clark, R., and Saurous, R. A. (2018). Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. Proceed- ings of the International Conference on Machine Learning, Stockholm, Sweden, July 10-15, 2018. Available at https://bit.ly/3CgXvPU. Accessed November 12, 2021, pp. 7471-7480.
Stevens, S. S., Volkmann, J., and Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8(3), 185-190.
van den Oord, A. Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. Available at https://bit.ly/3qtNrkm. Accessed November 12, 2021.
Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication 51(11), 1039-1064.
   About the Author
 H. Timothy Bunnell
tim.bunnell@nemours.org
Nemours Children’s Hospital, Delaware Center for Pediatric Auditory and Speech Sciences
1701 Rockland Road
Wilmington, Delaware 19803, USA
H. Timothy Bunnell is the director of the Center for Pediat- ric Auditory and Speech Sciences (CPASS) at the Nemours Children’s Hospital, Delaware, Wilmington; head of the Speech Research Lab in the CPASS; and an adjunct profes- sor of Computer and Information Sciences at the University of Delaware, Newark. He received his PhD in experimental psychology in 1983 from The Pennsylvania State University, University Park; served as research scientist at Gallaudet University, Washington, DC, from 1983 to 1989; and joined Nemours Children’s Health to found the Speech Research Laboratory in 1989. His research has focused on the applica- tions of speech technology for children with hearing and speech disorders.
    22 Acoustics Today • Spring 2022



































































   20   21   22   23   24