Page 26 - Spring2020
P. 26
Synthesis of Musical Instrument Sounds
Figure 5. Synthesis results displayed as “rainbowgrams” (after Engel et al., 2017), showing the intensity given by the logarithm of the magnitude, with frequency scaled logarithmically and colored by the instantaneous frequency. Shown are similar tones for a sustained C3 note of a string bass. The original recording and GANSynth example show greater color consistency, corresponding to greater phase coherence and perceived tone quality, than the WaveNet and WaveGAN examples.
identifiable sounds for human speech, WaveGAN was pre- ferred by human listeners for its sound quality and diversity.
Combining the successes of NSynth and WaveGAN along with new refinements and insights, two new similar models known as GANSynth (Engel et al., 2019) and TiFGAN (Marafioti et al, 2019) now stand as the state of the art for the NAS of musical instrument sounds. One improvement resulted from increasing the number of frequency bins in the STFT outputs via overlapping the frames more. Another insight used in these methods is the use of “instantaneous frequency,” which is the time derivative of the phase produced by the STFT, in the neural network. Figure 5 shows a comparison between outputs from the WaveNet autoencoder used in the NSynth release, WaveGAN, and GANSynth. Given the (random) gen- erative nature of these models, an exact tonal comparison is not feasible; however, we show similar-sounding tones for the C3 note (130.81 Hz) corresponding to the sound of a bass string. (For additional audio examples, click the Audio Examples link on the GANSynth Web page available at tinyurl.com/gansynth). The GANSynth output is superior to previous synthesized outputs in terms of its phase coherence, indicated in these “rainbowgrams” for which instantaneous frequency determines the color. Beyond quantitative testing, in a series of human trials consisting of 3,600 listening evalu- ations, listeners preferred GANSynth results, with approval ratings approaching (i.e., within 10% of) their preferences for acoustic recordings.
Unlike its WaveNet-based predecessors, which were devel- oped for variable-length audio sources such as speech and thus tended to be autoregressive (i.e., predicted one sample at
a time using previous outputs) and therefore slow, GANSynth generates entire audio clips all at once. Thus, both training the model and generating new samples occur much faster than previous methods, making it attractive for real-time sample generation. However, it remains to be seen how well it can be adapted for variable-length outputs such as entire music compositions (instead of single notes).
Although tone quality has seen significant improvement and nears acoustic recordings in listening tests, the audio is typi- cally generated using lower sample rates than compact disc (CD) quality sound. For example, the NSynth dataset used a sample rate of 16 kHz, which allows for a maximum rep- resentable frequency of only 8 kHz. The use of such reduced sample rates is a common theme in the NAS community, with CD-quality sample rates found in greater occurrence among specific domains such as audio engineering (e.g., Hawley et al., 2019). For many musical instruments, however, the mag- nitude of spectral content above 8 kHz is typically several orders of magnitude below lower frequency sounds so that the use of reduced sample rates remains a reasonable restric- tion for these use cases. Physics-based modeling, on the other hand, can generate tones at arbitrary sample rates and often without the issues of noise and phase coherence associated with the iterative approximation scheme of NAS methods.
Conclusions
This article provides an update on the methods for musical instrument sound synthesis. Depending on the intended goal of the synthesis and the resources at one’s disposal, a variety of methods are available, among which are detailed physics- based modeling of a musical instrument as well as NAS (i.e.,
26 | Acoustics Today | Spring 2020