Page 29 - Summer 2021
P. 29

  Figure 6. Feature extraction of [ǃoa]. An annotated spectrograph such as this allows researchers to extract a number of essential parameters of a sound as the first step in the digitalization process. Red dots, formant trackers; blue line, pitch tracker; yellow, intensity tracker, VOT, voice onset time. See text for further explanation.
Exemplification with !Xóo᷈
ǃXóo᷈ , known by the internationally recognized code ISO 639-3: nmn, is a critically endangered language spoken by 2,000 speakers in Botswana and 500 speakers in Namibia. According to Eberhard et al. (2019), its EGIDS rating is 6b. It has achieved celebrity status of a sort among endangered languages in Africa because of the richness of its clicks [ʘ, ǀ, ǃ, ǁ, ǂ]. Ladefoged and Maddieson (1996, p. 246) report that
“over 70% of the words in a ǃXóo᷈ dictionary begin with a click.” The word [!oa] used in this demonstration is found in a sound file located at UCLA Phonetic Archive (avail- able at bit.ly/3ouIC5J). The canonical syllable structure of ǃXóo᷈ is consonant vowel (CV) or consonant vowel1 vowel2 (CV1V2). The spectrograph in Figure 6 displays the main acoustic correlates that are extracted from that word.
Ten tiers are created in Figure 6 to extract and measure the relevant features of the word [!oa]. The first tier repre- sents the IPA transcription. The second tier is an Arpabet transcription, which is a unique transcription system very much like the IPA. It was designed in the 1960s to allow a standard keyboard to be used to transcribe speech accu- rately without the need to resort to IPA symbols.
Because the Arpabet is fully compatible with the Ameri- can Standard for Information Interchange Codes (ASCII), it is the premier transcription system used in coding for Text-to-Speech and Speech-to-Text for use in many voice- enabled applications (Jurafsky and Martin, 2000). The
third tier represents the individual segments in [!oa] and the fourth measures the Voice onset time (VOT; see Blum- stein, 2020). The VOT is the time interval between when articulators come together to produce [!]. F0 measures vocal fold vibration. Speech has three key formants, F1, F2, and F3, which appear as areas of concentration of acoustic energy in spectrographs. They are extremely important in speech synthesis. F1 correlates with the degree of opening of the mouth when a segment is produced. F2 indicates the horizontal movement of the tongue. F3 correlates with the state of the lips, whether they are rounded or unrounded. Intensity deals with the loudness of a sound and duration indicates the amount of time (in milliseconds) it takes to produce a sound. Taken together, these extracted tokens help to turn analog speech sounds into digits that become the necessary ingredients for speech synthesis.
Technologizing Endangered Languages
Voice-enabled speech technologies are ubiquitous in con- temporary life for the speakers of English and other elite world languages. However, no such technology exists in any of the more than 2,000 indigenous languages of Africa. For the languages on the brink of extinction, those with EGIDS 8a/8b and EGIDS 9, developing voice-enabled software packages for use in mobile devices can be a game changer. Mobile phone usage has exploded in Africa. A 2015 Pew Research Center (see pewrsr.ch/3afYYdp) study found that, between 2002 and 2014, mobile phone usage increased tenfold, from 8% to 83%, in Ghana and other sub-Saharan
  Summer 2021 • Acoustics Today 29

























































































   27   28   29   30   31