Page 25 - Spring2020
P. 25
ing of these attributes provided an opportunity to merge multiple instrument sounds, such as “bass + flute” or “flute + organ.” The ability for NSynth to merge sounds allowed for the creation of a physical touchpad controller (Engel, 2017) from which musicians could interpolate between sounds and generate new combinations in real time. Beyond the utility of the NSynth model itself, the NSynth dataset has provided significant benefit for the field because other research groups have used this dataset for training other models both to try to exceed the performance of the NSynth model (e.g., Défossez et al., 2018) and as a means of baseline comparison between
different models (e.g., Roche et al., 2018).
As noted above, the latent space representation in an autoen- coder can be altered and decoded to synthesize new sounds; however, different types of sounds may become associated with disjoint regions of the latent space, making interpolation between instruments behave strangely and produce unex- pected results. Furthermore, the decoding of a given set of latent features is identical each time. To provide variety in the instrument sounds and recast the system as a truly “genera- tive model” (i.e., one that produces novel output on each use), the autoencoder paradigm can be altered to model the prob- ability distribution of output audio features as a function of a learned probability distribution of features in latent space. Such systems are known as a variational autoencoders (VAEs). Although on a simplified level VAEs often amount to replacing single values in the latent space with the mean and standard deviation of a Gaussian distribution, VAEs can be considerably more difficult to train than ordinary “vanilla” autoencoders, and significant VAE results for musical instrument synthesis have only appeared relatively recently (e.g., Çakir and Vir- tanen, 2018). VAEs form a bridge to another class of generative models that also model probability distributions of sounds but dispense with the autoencoder form, as follows.
Generative Adversarial Network Approaches
A powerful paradigm emerging in recent years for the genera- tion of synthetic data is the GAN. A GAN can be regarded as two competing deep neural networks whose efforts are com- bined in a kind of “arms race” (illustrated in Figure 4). One part of the network is called the “generator” that synthesizes new data; the other part of the network is the “discriminator” that functions as a classifier to determine whether the data at its input is coming from the generator or is prerecorded data. This process has been likened to the process of coun- terfeiting, where the generator is the “criminal artist” and the discriminator is the “forensic detective.” The output from the
Figure 4. Overview of a generative adversarial network (GAN), a sort of “imitation game” played between two neural networks: a binary classifier called the discriminator seeks to improve at correctly “guessing” whether its input came from the dataset of instrument recordings or is a “forgery” synthesized by the generator. The generator uses information from the optimization procedure of the discriminator (e.g., the negative of the gradients) to synthesize increasingly “convincing” instrument sounds.
discriminator is used to train the generator, so that over time, its outputs more closely resemble the prerecorded audio.
Initially, GANs were applied to image synthesis, followed by some synthesis of speech audio, but musical instrument synthesis remained untouched until a noteworthy preprint appeared in early 2018, stating “In this paper we introduce WaveGAN, a first attempt at applying GANs to unsuper- vised synthesis of raw-waveform audio” (Donohue et al., 2019). WaveGAN applied a one-dimensional version of the two-dimensional convolutions used for image-synthe- sizing GANs (i.e., it did not use a WaveNet architecture) applied to raw audio samples. The paper featured another model, “SpecGAN,” operating on spectrogram images in the manner of preexisting GANs. These two models were trained on datasets that included piano and drums. In the case of drums, the WaveGAN model was featured in a Web demo of an interactive drum machine (see Multime- dia3 at acousticstoday.org/hawleymedia) for which novel drum samples could be synthesized at the press of a button. Although SpecGAN tended to produce slightly more easily
Spring 2020 | Acoustics Today | 25