Page 52 - Winter2021
P. 52
MACHINE LEARNING AND ACOUSTICS
The dot product is then transformed (Figure 3b) by a dif- ferentiable nonlinear function called an activation (e.g., a sigmoid function or one that sets negative values to zero). Nodes are arranged into layers, and in a classic feed-forward network, the outputs of one layer serve as inputs to the next layer (Figure 3c). Thus, one can think about each node as making a local decision about which side of a hyperplane its inputs fall on and propagating this knowledge to subsequent nodes.
The final layer is responsible for the prediction and either outputs a predicted value for regression problems or a category (Figure 3d), frequently represented as a vector representing the probability of belonging to each class. The recent interest in deep networks, networks that have many layers, is due to these networks repeated ability to provide significant advances in the state of the art across a wide range of problem domains (LeCun et al., 2015) as well as transfer learning, which utilizes knowledge from an already trained network to a new dataset, similar but far from being identical to the original one.
Neural network training is usually accomplished by an iterative procedure called backpropagation (Goodfellow et al., 2016; Bianco et al., 2019). At each step, training examples are presented to the network. For each exam- ple, the loss, a measure of deviation from the intended result, is computed. The derivative (gradient) of the loss function with respect to a node’s weights indicates the direction in which changing the weight vector would create the largest increase in loss. To decrease the loss, the weights can be modified by a small amount in the opposite direction (gradient descent). This technique can be “backpropagated” through the network, com- puting the loss gradient at each node and permitting adjustments to weights in layers other than the last one. The training process is repeated until a conver- gence criterion is met. Backpropagation depends on many factors, and a thorough discussion can be found in Goodfellow et al. (2016).
In acoustics, neural networks have provided advances in speech recognition (Hinton et al., 2012), room localiza- tion (Chakrabarty and Habets, 2019), direction of arrival estimation (Ozanich et al., 2020), bioacoustics (Stowell et al., 2019), sea bed classification (Frederick et al., 2020), and increasing speech intelligibility (Healy et al., 2019) among many other areas.
Two popular forms of neural networks are convolutional neural networks (CNNs) and recurrent neural networks (RNNs) (Goodfellow et al., 2016). Convolutional net- works are used to recognize local structure in both the input space as well as in hidden layers that contain abstract representations of information needed to make a decision. Convolutional layers learn matched filters that are moved across the input or intermediary data. These outputs are filtered again and combined in subsequent layers and may have operations to reduce the dimension (pooling). A final set of feed-forward layers perform classification or regres- sion. Figure 4a shows an application of a convolutional network to the problem of detecting a type of contact call produced by endangered North Atlantic right whales (Eub- alaena glacialis). The first set of learned filters are shown. Some of these filters produce strong outputs when calls are present and others when calls are absent and serve as features for subsequent layers of the network.
Recurrent neural networks introduce dependencies among subsequent inputs, allowing the network to learn the tem- poral structure (Figure 4b). They are commonly used in time-varying acoustic problems where signal evolution is important, such as speech recognition (e.g., Amodei et al., 2016). A drawback to these types of units is that information decays at each time step, and it is difficult to learn long-term dependencies. There are several variants of this architec- ture such as gated recurrent units (GRUs) and long-short time memory units (LSTMs) that allow network nodes to learn concepts such as when the input is relevant or when the input history state should be cleared (Goodfellow et al., 2016). It is common to combine convolutional and recurrent networks, with the convolutional network acting as a fea- ture extractor and the recurrent network capturing temporal relationships between the features.
Bias and Variance
Most nontrivial problems have an inherent confusability that cannot be resolved regardless of the learner. This is called the Bayes error. Learning algorithms do not usually achieve the Bayes error, and additional error can be attributed to two sources, bias and variance (Hastie et al., 2009). Bias is the additional error that can be attributed to a learner not being capable enough to learn the distributions or separating boundary. For example, in the formant data, the vowels ʊ (book) and u (boot) contain examples that cannot be separated by a linear curve. Any classifier incapable of producing
52 Acoustics Today • Winter 2021