DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using GANs


This blog serves as supplementary material to our paper "DrumGAN: synthesis of drum sounds with timbral feature conditioning using Generative Adversarial Networks"accepted for ISMIR2020. DrumGAN is a Generative Adversarial Network (GAN) that synthesizes drum sounds and offers high-level control of the sound's timbre characteristics. By conditioning a Progressive-growing GAN (P-GAN) on descriptors extracted using the Audio Commons timbre models, it enables the parametrization of the synthesis process on perceptual and musically meaningful features (e.g., boominess, sharpness, brightness, etc.). We encourage the reader to check out the paper for details regarding the architecture, the conditional features, and the quantitative results. Below, we show some examples of drum loops created by a music producer using DrumGAN. These examples give a good taste of the quality and variety of the sounds produced by DrumGAN.

HIP-HOP LOOP

TECHNO LOOP

TRAP LOOP

In the following, we present one-shot sound examples generated by DrumGAN under different conditional settings, as well as the baselines against which we compared. Then, we show examples showing that increasing or decreasing a specific feature value of the conditioning input, yields a reasonably coherent change of that feature in the synthesized audio. Also, we show examples of linear and spherical interpolations performed in the latent space. Finally, a few examples of sounds with low and high Wasserstein distances are given.

1) Data

The dataset used for training and evaluating DrumGAN is composed of approximately 300,000 one-shot audio samples aligned, and distributed across a balanced set of kick, snare, and cymbal sounds. The samples originally have a sample rate of 44.1kHz and variable lengths. In order to simplify the task, each sample is cut to a duration of one second and down-sampled to a sample rate of 16kHz. For each audio sample, we extract perceptual features with the Audio Commons timbre models. We perform a 90/10% split of the dataset for validation purposes. The model is trained on the real and imaginary components of the Short-Time Fourier Transform (STFT), computed using a window size of 2048 samples and 75% overlapping.

Sound examples of the training data

2) DrumGAN

Here we show randomly picked examples generated by DrumGAN in the following conditional settings: when using feature vectors obtained from 1) the training set ('train feats' in the paper), 2) from the validation set ('val feats') or 3) randomly sampled from a uniform distribution ('rand feats').

train feats

val feats

rand feats

3) Baselines

In the paper, DrumGAN is compared against real data (see Sec. 1), an unconditional version of DrumGAN (unconditional), and a U-Net baseline from a previous work tackling the same task as we do here (i.e., synthesis of drum sounds conditioned on the same high-level attributes that are described in the paper). Following, we provide examples for each of these baselines. In the case of U-Net, the input features are original labels from the dataset ('real labels') and random labels sampled from a uniform distribution.

Unconditional DrumGAN


U-Net real labels

U-Net random labels

4) Feature Coherence

We follow the methodology proposed in previous work for evaluating the feature controllability. The following examples demonstrate that increasing or decreasing a specific feature value of the conditioning input, in most cases, yields a coherent change of that feature in the synthesized audio.

a) Kick

Original features

0.2

0.5

0.8

b) Snare

Original features

0.2

0.5

0.8

b) Cymbal

Original features

0.2

0.5

0.8

5) Interpolations

We perform radial and spherical interpolation experiments (with respect to the Gaussian prior) between random points selected in the latent space of DrumGAN. Both interpolations yield smooth and perceptually linear transitions in the audio domain. We notice that radial interpolation tends to change the percussion type (i.e., kick, snare, cymbal) of the output. In contrast, spherical interpolation affects other properties (like within-class timbral characteristics and envelope) of the synthesized audio. This gives a hint on how the latent manifold is structured and suggests that relevant musical and perceptual factors of variation are disentangled.

radial interpolation

spherical interpolation

6) In/outliers

For this experiment, we use DrumGAN's discriminator to estimate the Wasserstein distance over a set of 10k generations. Here we show the ten generated sounds with lowest and highest absolute distance from the set. As we can hear, samples with low Wasserstein distance have a better sound quality and a timbre similar to that of the original data (see Sec. 1). In contrast, those examples with the highest distance have odd-sounding characteristics. These sounds have less in common with the real data and are less likely to be generated.

low Wasserstein distance

high Wasserstein distance