Our reference architecture is a reduced version of
GAN-Synth, the state-of-the-art in audio generation using GANs. We use fewer layers and CNN sizes than in the original work. The architecture is built upon a
Progressive Growing GAN (P-GAN), borrowed from the computer vision literature, which has become a benchmark in the field. The generator is depicted in the figure. The generator G samples a random vector z from a spherical Gaussian and feeds it, together with a conditioning pitch label information, through a stack of convolutional and box-up-sampling blocks to generate the output signal x = G(z). The discriminator D is composed of convolutional and down-sampling blocks, mirroring the configuration of the generator. D estimates the
Wasserstein distance between the real and generated distributions. By explicitly feeding the pitch label as conditioning to the model, we enable independent musical control of pitch and timbre over the synthesized audio.