Special Issue

Page 48 - Special Issue

P. 48

Early Talking Automata
Edison’s phonograph in 1877, and attention shifted to acoustic reproduction of the speech signal rather than simulation of the physical system that produced it, using the next new technology.
Modern Recapitulations? Links to the Prehistory of Speech Synthesis
When reviewing the history of mechanical speaking machines, it is remarkable how many of the problems and solutions that preoccupied whole generations of early speech scientists continue to reappear today, with striking parallels in modern theories of speech production and perception.
How is sound produced in the vocal tract? The source-filter model of speech production (Fant, 1960) was never explic- itly articulated before the twentieth century, yet in all of the speaking machines there was clear recognition early on of the need for an aerodynamic flow from the lungs, a vibratory or turbulent source from the larynx, and the shaping of sound by a tube. Debates about the generation of sound by the motion of the vocal folds and glottal airflow that began with Dodart and Ferrein continue to be central to current research on aeroacoustics and fluid-structure interaction in speech (McGowan, 1992). Kempelen, Mical, and Faber all experi- mented extensively with different glottal geometries, making meticulous empirical observations about the influence of the glottal shape and vocal fold tension on airflow, vibration, tur- bulent noise, and the quality of the resulting sound. They realized the importance of damping the reed with leather and leaving a gap to bias the airflow to avoid irregular vibra- tion and harshness. Experiments on excised larynges and mechanical analogs of the vocal folds continue this vein of inquiry to the present day (e.g., Birk et al., 2017), albeit with the added novelty of computer simulations. A further con- stant thread has been the realization that the mechanisms of human speech have parallels in vocal production across other species. Casserius (1600) includes comparative anatomies of the larynx in a variety of creatures (cf. Negus, 1929), whereas
Vicq d’Azyr (1779) and Kempelen (1791) both consider in detail analogies between sounds and sound production in humans and other animals (cf. Fletcher, 1992).
Which vocal tract shape produces what sound? Understand- ing the complex many-to-one relationship between vocal tract geometry and acoustics is a perennial theme. Puzzling to all of these investigators was the difficulty in deriving appropriate tube shapes corresponding to particular speech sounds, perhaps because they lacked the ability to see inside
148 | Acoustics Today | Suprminmge2r0201,9Special Issue Reprinted from volume 15, issue 2
the vocal tract but also because of the laws of physics. The low-frequency eigenmodes of the vocal tract are only sensi- tive to long-wavelength perturbations of vocal tract geometry, so any tube that replicates the macroscopic shape of the quasi-1-D area function regardless of microscopic 3-D details will approximate the same resonances (e.g., Ungeheuer, 1962), which is the origin of the inverse mapping problem. Follow- ing the modern tendency toward increasingly overdetailed 3-D vocal tract models, Faber chose to bring his machine closer and closer to an actual vocal tract to be able to exploit physical constraints, whereas Kratzenstein, Kempelen, and Mical perhaps intuitively understood that extreme spatial accuracy or exact reproduction is not always needed, as long as a functionally equivalent tube shape is somehow achieved by hand or box. Kratzenstein’s tubes, which sound like vowels but look nothing like vocal tracts, are the classic example.
How is the vocal tract controlled, and what are the under- lying goals and units of speech production? For all of the speaking machines, progress toward intelligible synthesis was only made when the temporal dynamics of speech began to be accurately captured, either mechanically, as in Mical’s programmable cylinder, or by harnessing human action systems to bootstrap the sequencing of vocal tract movements, as in Kempelen and Faber’s wind and keyboard instruments. Mical’s attempts at concatenative synthesis using fixed demisyllabic units were less successful than the flexible manual coarticulation that Kempelen’s speaking machine allowed. The solution afforded by Faber’s keyboard, which succeeded in yoking multiple articulators together in sequence as composite actions to realize the discrete sounds played by each key, is directly analogous to the modern concept of “coordinative structures” in task dynamics by which multiple end effectors are flexibly co-opted to real- ize a sequence of goals (Turvey, 1990). In the same light, Techmer’s “articulatory score,” which relates those goals to an alphabet of vocal tract constrictions that function as embodied phonological symbols, has striking parallels with the influential theory of gestural phonology proposed by Browman and Goldstein (1992). Articulatory control and timing have always been central theoretical and practical issues in speech, then as now.
All of these examples demonstrate that mechanical speaking machines were not simply idle amusements but rather can be considered as early attempts at fully embodied theories of speech production, successfully tackling problems that

46 47 48 49 50