comparing_audio_representations_for synthesis_with_GANs

Comparing representations for audio synthesis with GANs

Javier Nistal, Stefan Lattner and Gaël Richard

In this post, we extend the results of our work comparing different audio representations for the task of audio synthesis of musical notes with Generative Adversarial Networks (GANs). Here we provide examples of synthesized audio for GANs trained with each audio representation. You can listen to some examples of music scales, interpolations of the model's latent representations, random generations and generations from MIDI music pieces synthesized by our pitch-conditional GANs.

The GAN architecture

Our reference architecture is a reduced version of GAN-Synth, the state-of-the-art in audio generation using GANs. We use fewer layers and CNN sizes than in the original work. The architecture is built upon a Progressive Growing GAN (P-GAN), borrowed from the computer vision literature, which has become a benchmark in the ﬁeld. The generator is depicted in the figure. The generator G samples a random vector z from a spherical Gaussian and feeds it, together with a conditioning pitch label information, through a stack of convolutional and box-up-sampling blocks to generate the output signal x = G(z). The discriminator D is composed of convolutional and down-sampling blocks, mirroring the conﬁguration of the generator. D estimates the Wasserstein distance between the real and generated distributions. By explicitly feeding the pitch label as conditioning to the model, we enable independent musical control of pitch and timbre over the synthesized audio.

true_data.mp3

The dataset

For this work, we make use of a subset of the NSynth dataset, consisting of approximately 22,000 single-note audios belonging to acoustic instrument classes from brass, flutes, guitars, keyboards, and mallets families. Each sample is one second long, with a 16kHz sample rate, and with a pitch ranging from MIDI 44-70 (103.83 - 466.16 Hz).

Audio examples

Following, we provide audios synthesized by models trained on the following audio representations.:

Waveform
Complex-valued Short-Time Fourier Transform (STFT)
Magnitude and Instantaneous Frequency (IF) of the STFT
Constatn-Q Transform (CQT)
Constant-Q Non-Stationary Gabor Transform (CQ-NSGT)
Mel-Spectrum
Mel-Frequency Cepstrum Coefficients (MFCC)

You can listen to examples of:

Scales: we fix a latent code Z and generate a sample for each conditional pitch label
Latent interpolations: we fix the conditional pitch label and perform a linear interpolation between two random latent Z vectors with 10 steps
Random generation: we generate samples from random Z and pitch label
MIDI generation: Beethoven's Für Elise and Bach's Prelude N1. Notice that pitches outside the range of generation (MIDI 44-70) in the original score are mapped into the according octave within this range.

1) Waveform

The raw audio waveform consists of a sequence of numerical samples that specify the amplitude values of the signal at time-steps t. Using this representation as input is challenging for generative modeling, particularly in the case of music signals. On the other hand, it enables neural networks to build the representation that better suits a specific task without any prior assumptions.

Scale

scales_1-001.mp3

scales-001.mp3

Interpolation

interpolations_1-001.mp3

interpolations-001.mp3

random generation

random-001.mp3

Für Elise

waveformmidi_furelisa_adapted_2020_02_28_16.mp3

Prelude No. 1

waveform_prelude_adapted_2020_02_28_16.mp3

2) Complex-valued STFT

The Short-Time Fourier Transform (STFT) decomposes a signal as a weighted sum of complex sinusoidal basis vectors with linearly spaced center frequencies, unveiling the time-frequency structure of an audio signal. It is cheap to compute and perfectly invertible into the raw audio domain, which makes it popular for synthesis. The main disadvantage is the noisiness of its phase component, which is hard for neural networks to model.

Scale

scales_1-002.mp3

scales-002.mp3

Interpolation

interpolations_1-002.mp3

interpolations-002.mp3

random generation

random-002.mp3

Für Elise

complex_midi_furelisa_adapted_2020_02_28_05.mp3

Prelude No. 1

complex_prelude_adapted_2020_02_28_15.mp3

3) Magnitude & IF

The Instantaneous Frequency (IF) organizes the information in the phase spectrum by unwrapping it and computing the derivative with respect to the time dimension. IF provides a measure of the rate of change of the phase information as a function of time.

Scale

scales_1-003.mp3

scales-003.mp3

Interpolation

interpolations_1-003.mp3

interpolations-003.mp3

random generation

random-003.mp3

Für Elise

specgrams_midi_furelisa_adapted_2020_02_28_15.mp3

Prelude No. 1

specgrams_prelude_adapted_2020_02_28_16.mp3

4) CQT

The Constant-Q Transform (CQT) decomposes a signal as a weighted sum of tonal-spaced filters, where each filter is equivalent to a subdivision of an octave. This musically motivated spacing of frequencies enables representing pitch transpositions as simple shifts along the frequency axis, which is well-aligned with the equivariance property of the convolutional operation. The main disadvantage of CQT over STFT is the loss of perceptual reconstruction quality due to the frequency scaling in lower frequencies. In the following examples, we train a model using a pseudo invertible CQT as audio representations.

Scale

scales_1-004.mp3

scales-004.mp3

Interpolation

interpolations_1-004.mp3

interpolations-004.mp3

random generation

random-004.mp3

Für Elise

cqt_midi_furelisa_adapted_2020_02_28_15.mp3

Prelude No. 1

cqt_prelude_adapted_2020_02_28_17.mp3

5) CQ-NSGT

The following sound examples are generated by a model trained on CQT representation based on the Non-Stationary Gabor Transform (CQ-NSGT), which allows for perfect reconstruction.

Scale

scales_1-005.mp3

scales-005.mp3

Interpolation

interpolations_1-005.mp3

interpolations-005.mp3

random generation

random-005.mp3

Für Elise

nsgt_midi_furelisa_adapted_2020_02_28_17.mp3

Prelude No. 1

nsgt_prelude_adapted_2020_02_28_17.mp3

6) Mel Spectrum

The Mel Spectrum compresses the STFT in frequency axis by projecting it into a perceptually inspired frequency scale, called the Mel-scale. Mel discards the phase information, so we use the iterative method from Griffin and Lim to recover the phase for synthesis.

Scale

scales_1-006.mp3

scales-006.mp3

Interpolation

interpolations_1-006.mp3

interpolations-006.mp3

random generation

random-006.mp3

7) MFCC

The Mel Frequency Cepstral Coefficients (MFCC) provides a compact representation of the spectral envelope of an audio signal. Originally developed for speech recognition, they are now widely used in musical applications, as they capture perceptually meaningful musical timbre features.

Scale

scales_1-007.mp3

scales-007.mp3

Interpolation

interpolations_1-007.mp3

interpolations-007.mp3

random generation

random-007.mp3

Für Elise

midi_furelisa_adapted_2020_02_28_18.mp3

Prelude No. 1

prelude_adapted_2020_02_28_20.mp3

Page updated

Google Sites

Report abuse