Page 54 - Winter2021
P. 54

different clusterings of the same or similar data for con- sistency or to determine how well the cluster analysis agrees with an established partitioning method such as human analysts. The adjusted Rand index is one popular extrinsic measure that compares whether or not pairs of examples are consistently assigned to the same cluster or different ones (Hubert and Arabie, 1985).
Supervised Learning Metrics
In supervised learning scenarios, the main question is whether or not the learner will perform well on novel data. Learners should never be evaluated on the data that were used to train them. N-fold cross-validation (Bianco et al., 2019) is a commonly used strategy that splits data into N partitions. N different trials are con- ducted, with N−1 partitions contributing training data and the remaining partition being used for testing. When developing a classification algorithm, it is rare to produce something that works satisfactorily on the first try. Typi- cally, there are a series of experiments where the model parameters are adjusted such that the model performs better on the test data. Such adjustments can be seen as a weak form of training, and as a consequence, it is recom- mended to have a held-out dataset that is not evaluated until one is satisfied with the learning algorithm.
Various metrics have been used for evaluating super- vised learning. A common detection task is to determine if a type of signal is present within a fixed time bin. There are two types of error for this task. False positives or false alarms occur when a bin is mistakenly reported as an occurrence of the signal of interest. False negatives or misses occur when a signal of interest is not reported within the bin. Whether or not a signal is reported is dependent on the threshold. For example, if a neural network produced a probability score, we might set a lower threshold if our goal was to find all instances of a signal and a higher threshold if our goal was to minimize the nuisance caused by false alarms. Various plots have been proposed to visualize this variability. The receiver operating curve (ROC; Fawcett, 2006) plots the false positive rate versus the true positive rate at different thresholds.
Useful variants of this are the detection error tradeoff (DET) curve (Martin et al., 1997) and the precision recall (PR) curve (Davis and Goadrich, 2006). The DET curve makes the assumption that scores are normally distributed
and scales the plots using a standard normal deviate. This has the desirable property of separating curves that are close together in ROC space, making it easier to compare systems. DET curves can also add penalties for different types of error, making it easier to see how the performance varies with respect to specific operational goals.
PR curves plot the percentage of detections that are cor- rect (precision) against the percentage of target signals that were correctly detected (recall). PR curves are inde- pendent of the number of signal-absent bins. For rare signals in a long time series, PR curves offer a significant advantage. The number of correctly classified signal- absent cases plays a role in ROC/DET curves and can result in low false-positive rates even when the false-pos- itive count greatly exceeds the number correctly detected signals. PR curves also offer the advantage of not requir- ing detections to be reported on fixed time bins. The F1 metric is the harmonic mean of the precision and recall at a specific operating point and can be used to summarize a point on the PR curve into a single number.
For classification tasks with multiple categories, the error rate is commonly reported, and confusion matrices are fre- quently used to visualize the results. The rows of a confusion matrix represent actual categories, whereas the columns represent predicted categories. Counts or percentages summarize how well the system functions, with correct classifications being shown along the diagonal of the matrix.
Finally, regression tasks use some measurement of how far the prediction is from the desired target. The squared error distance is a common measurement.
One of the drawbacks of many machine learning tech- niques is the so-called “black box” syndrome, where the predictions of a learner are not interpretable by the user. Some methods, such as the aforementioned decision trees, have the quality of being explainable, which can be very helpful when trying to understand why classification failed. Most techniques, such as deep neural networks that have millions of parameters, are very difficult to interpret, and correcting errors usually requires expert insight as to the root cause of a problem. Improving the ability to explain such models is an open area of research (see Linardatos et al., 2021 for a review). Strategies for understanding why models make the predictions they do
54 Acoustics Today • Winter 2021

   52   53   54   55   56