The Tools

MusicGen is a single-stage auto-regressive transformer model trained with over 20 thousand hours of licensed music, 10 thousand of which are an internal set of high-quality tracks from ShutterStock and Pond5. The program uses a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. What is peculiar about MusicGen is that it does not need self-supervised semantic representation. It generates all 4 codebooks in one pass by introducing a small delay between them. The programmers show MusicGen can predict them in parallel by having 50 auto-regressive steps per second of audio, hence the 50Hz sampling.

- When using MusicGen, we didn't change a lot with model_version, we only modify the prompts. Here is what that looked like for us!

Page updated

Report abuse