Music Genius
Github link: https://github.com/RunzeZhu28/Melody-Generator
Project Report : https://drive.google.com/file/d/1FlBhvCxOu8_JOcBI40fxJM4QnOb35Q4e/view?usp=sharing
Github link: https://github.com/RunzeZhu28/Melody-Generator
Project Report : https://drive.google.com/file/d/1FlBhvCxOu8_JOcBI40fxJM4QnOb35Q4e/view?usp=sharing
Music Genius aims to provide an immersive musical experience, catering to both professional musicians seeking creative exploration and amateurs looking to compose music effortlessly. To achieve this, a deep-learning model have been developped with dual functions: music genre classification and melody generation. This architecture consists of two interconnected components. The first, the classifier, takes in a pure piano melody input and predicts the music’s genre. The second, the music generator, utilizes the user’s piano melody (from the classifier) and the identified genre (predicted by the classifier) to generate new music.
Figure 1: Overall Block Diagram for Music Genius
For the classifier, the pure piano dataset "adl-piano-midi" has been chosen from Github with various MIDI files classified into 12 genres which are going to be learned by our model.
For genre classification, the audio feature extraction method of Mel-Frequency Cepstrum Coefficients (MFCC) have been applied which has 39 features and requires the input to be WAV format audios. The conversion from MIDI format to WAV files was performed with FluidSynth, an API that reads SoundFont and MIDI events and files and sends the digital audio output to a sound card. During conversion, we randomly sliced the pieces into segments of 5 to 20 seconds since the users’ inputs are mostly prompt and therefore incomplete.
The extraction of MFCC was performed with LibRosa, resulting Ndarrays of shape [t, n mfcc], where t is the number of frames calculated based on duration, and n mfcc is the number of mfcc features. By default, we set n mfcc to 13 and zero-padded all the arrays to fixed shape of 20 seconds (corresponding to t = 1722) for consistency. Then we detected the proportion of silence in each music piece and removed files with more than 30% silent part to ensure meaningful representation of the musical pattern.
Finally, the prelabeled genres have been converted to one-hot embeddings and split the processed dataset into 70% training and 30% validation.
The same adl-piano-midi dataset and chose the same 12 genres as the classifier part shows. What’s more, the Music21 library was used that includes many functions on music to handle data processing.
Based on the current knowledge, AI cannot learn music directly by hearing, so the contents of music need to be converted into the tokens that they can understand and let them learn from it. Hence, we first loaded songs of each genre separately and tokenized them by representing each pitch by its MIDI value and rests by -1 with the help of Music21 library. For example, an excerpt of melody can be expressed in: (43, 43, -1, -1, 56, 61, 57, -1, 43). This gave us a sequence of tokens that accurately represents the progression of the music.
However, the dataset has limited samples which may lead to bad performance of our model. So, to increase the number of data, we applied slide sampling that splits each song into smaller sequences with various length within a range and sequence with some overlaps with each other. Using the examples from earlier, the excerpt would be split into (43, -1, -1, 56), (-1, -1, 56, 61,57),( 61, 57, -1, 43). And to improve the quality of data, we discarded data with too many rest tokens (more than 30%). Next, we one-hot encoded all the files, transforming its content into indices using the mapping. Finally, we split the processed dataset into training and validation by 7:3 and passed the data to our model.
To combine the classifier and generator, the output from the classifier is fed to the generator. A pre-trained dictionary has been used to map genres to appropriate input and output sizes, creating a new RNN model to generate melodies. Pre-trained parameters for generation were saved in pth files for each genre and loaded to the model based on the classifier’s prediction.