Page 50 - Winter2021
P. 50

groups that are closest to the outlier versus vowels versus those more similar to the remaining examples. This process is repeated on each group until a stopping criterion is met.
Conversely, in bottom-up processing, all elements start in their own group and the two groups that are the most similar are merged together. Returning to the formant data (Figure 1), we would merge points that are closest to one another in the formant space. This would be repeated until all the vowels were in a single group.
Either method produces a hierarchical tree. Branches of the tree can be assigned to partitions if desired (Hastie et al., 2009). Using these types of methods produces clusters that do not require the number of partitions to be known a priori.
Low-Dimensional Representations of Data
Manifold learning is a dimension reduction technique that may be used either as a feature extraction step or as a precursor to a clustering algorithm. For example, the spectra of vowels consist of many frequencies. Yet, as seen in Figure 1, the vowels can be reasonably well repre- sented by a manifold consisting of the first two formants.
Principal components analysis (PCA; Bianco et al., 2019) is a classical method that can be used to reduce the dimen- sionality of feature spaces. PCA reorients the axes of the example space so that each subsequent axis accounts for less of the variance of the dataset. Because each new axis accounts for progressively less of the data’s variability, some axes can be dropped and the new reduced-dimension PCA space can provide a good approximation of the dataset.
Two popular alternative approaches are t-distributed sto- chastic neighbor embedding (t-SNE; van der Maaten and Hinton, 2008) and uniform mapping and projection (UMAP; McInnes et al., 2018). These nonlinear methods work by matching points in a high dimensional space with an equal number of points in a low-dimensional space. Both attend to the local neighborhoods about points and attempt to align the distribution of points in the high- and low-dimensional spaces using information theoretic mea- sures. UMAP tends to better preserve gaps between clusters.
Supervised Learning
The task of a supervised learner is to estimate a relation- ship based on labeled examples. In the case of regression problems, the mapping is a function, whereas classification
problems partition the feature space into regions associ- ated with categories. There are many different types of supervised classifiers, but they all do one of two things. They either learn the distribution of data or learn boundar- ies between different types of data.
Distributional Learners
In classification problems, distributional learners attempt to learn the class distributions from training examples. This is known as the posterior distribution and is the probability of a specific class given evidence in the form of features. In our formant data, it is the probability of a specific vowel given the formant measurements. Cat- egory decisions are made by examining the posterior probability for each class and selecting the class asso-
ciated with the highest one. This is known as a Bayes classifier (Hastie et al., 2009) and is optimal when the posterior distributions are correct. As learned distribu- tions are approximations, this assumption is rarely met.
The posterior distribution can be difficult to estimate. It is common to solve an equivalent maximization. The pos- terior can be replaced by the product of the probabilities of evidence given the assumption of a specific class (the class-conditional probability) and the probability of the class occurring (the prior probability). An example of a prior probability is someone saying “Hello” at random versus the class-conditional probability of someone saying “Hello” when greeting someone.
Mixture models (Hastie et al., 2009) are an example of a distribution learner that use a linear combination of simple parametric distributions (e.g., Gaussians) to model complex distributions. Each distribution in the model has a weight that controls its contribution to the complex distribution. Training the models requires estimating the mixture weights and the parameters of each parametric distribution. This can be done with an iterative procedure that alternates between determining the expected value of the mixture weights and improving the mixture parameters through maximum likelihood estimation (Bianco et al., 2019). In this type of supervised learning, we learn the distribution of each class separately. In the formant data, we have trained one model for each vowel. The training of each model is a form of unsupervised learning because we do not label the variations within specific vowels. Figure 2 shows lines of equal probability (isocontours) for each vowel, and these distributions could not easily be modeled with a single
50 Acoustics Today • Winter 2021

   48   49   50   51   52