Page 20 - Summer2019
P. 20
Early Talking Automata
Edison’s phonograph in 1877, and attention shifted to acoustic the vocal tract but also because of the laws of physics. The
reproduction of the speech signal rather than simulation of the low-frequency eigenmodes of the vocal tract are only sensi-
physical system that produced it, using the next new technology. tive to long-wavelength perturbations of vocal tract geometry,
so any tube that replicates the macroscopic shape of the
Mode rn Recapitulations? Links to the Prehistory quasi-1-D area function regardless of microscopic 3-D details
of Speech Synthesis will approximate the same resonances (e.g., Ungeheuer, 1962),
When reviewing the history of mechanical speaking machines, which is the origin of the inverse mapping problem. Follow-
it is remarkable how many of the problems and solutions ing the modern tendency toward increasingly overdetailed
that preoccupied whole generations of early speech scientists 3-D vocal tract models, Faber chose to bring his machine
continue to reappear today, with striking parallels in modern closer and closer to an actual vocal tract to be able to exploit
theories of speech production and perception. physical constraints, whereas Kratzenstein, Kempelen, and
Mical perhaps intuitively understood that extreme spatial
How is sound produced in the vocal tract? The source-filter accuracy or exact reproduction is not always needed, as long
model of speech production (Fant, 1960) was never explic- as a functionally equivalent tube shape is somehow achieved
itly articulated before the twentieth century, yet in all of the by hand or box. Kratzenstein’s tubes, which sound like vowels
speaking machines there was clear recognition early on of the but look nothing like vocal tracts, are the classic example.
need for an aerodynamic flow from the lungs, a vibratory or
turbulent source from the larynx, and the shaping of sound How is the vocal tract controlled, and what are the under-
by a tube. Debates about the generation of sound by the lying goals and units of speech production? For all of the
motion of the vocal folds and glottal airflow that began with speaking machines, progress toward intelligible synthesis
Dodart a.nd Ferrei.n continue to be central to current research was only made when the temporal dynamics of speech
on aeroacoustics and fluid-structure interaction in speech began to be accurately captured, either mechanically, as in
(McGowan, 1992). Kempelen, Mical, and Faber all experi- Mical’s programmable cylinder, or by harnessing human
mented extensively with different glottal geometries, making action systems to bootstrap the sequencing of vocal tract
meticulous empirical observations about the influence of the movements, as in Kempelen and Faber’s wind and keyboard
glottal shape and vocal fold tension on airflow, vibration, tur- instruments. Mical’s attempts at concatenative synthesis
bulent noise, and the quality of the resulting sound. They using fixed demisyllabic units were less successful than the
realized the importance of damping the reed with leather flexible manual coarticulation that Kempelen’s speaking
and leaving a gap to bias the airflow to avoid irregular vibra- machine allowed. The solution afforded by Faber’s keyboard,
tion and harshness. Experiments on excised larynges and which succeeded in yoking multiple articulators together in
mechanical analogs of the vocal folds continue this vein of sequence as composite actions to realize the discrete sounds
inquiry to the present day (e.g., Birk et al., 2017), albeit with played by each key, is directly analogous to the modern
the added novelty of computer simulations. A further con- concept of “coordinative structures” in task dynamics by
stant thread has been the realization that the mechanisms of which multiple end effectors are flexibly co-opted to real-
human speech have parallelsi.nvocal production across other ize a sequence of goals (Turvey, 1990). In the sa.rne light,
species. Casserius (1600) includes comparative anatomies of Techmer’s “articulatory score,” which relates those goals
the larynx in avariety of creatures (cf. Negus, 1929), whereas to an alphabet of vocal tract constrictions that function as
Vicq d’Azyr (1779) and Kempelen (1791) both consider in embodied phonological symbols, has striking parallels with
detail analogies between sounds and sound production in the influential theory of gestural phonology proposed by
humans and other animals (cf. Fletcher, 1992). Browman and Goldstein (1992). Articulatory control and
timing have always been central theoretical and practical
Which vocal tract shape produces what sound? Understand- issues in speech, then as now.
ing the complex many-to-one relationship between vocal
tract geometry and acoustics is a perennial theme. Puzzling All of these examples demonstrate that mechanical speaking
to all of these investigators was the difliculty in deriving machines were not simply idle amusements but rather can
appropriate tube shapes corresponding to particular speech be considered as early attempts at fully embodied theories
sounds, perhaps because they lacked the ability to see inside of speech production, successfully tackling problems that
18 | Anaiaslicl Tbslay | Summer 2019