Read http://www.cse.cuhk.edu.hk/~khwong/www2/cmsc5707/Tutorial_LSTM_music_genre_classification.docx
Handle sound recording
1) https://www.goldwave.com/, download and install the free version.
2) Record/edit you sound and save it using .au (Sun) file format, it is for https://librosa.org/ , software to convert .au sound to mfcc code.
3) Install tensorflow 2 if you have not installed it before. See https://sites.google.com/site/hongslinks/tensor_windows
/////////////////////////////////////////////////////////////////////////////////////// , the main demo program is "lstm_genre_classifier_keras.py" //////////////////////////////
readme_genre_classification_lstm_201012.txt
Perquisite: Installation Tensorflow guide, see below
https://www.tensorflow.org/tutorials or https://sites.google.com/site/hongslinks/tensor_windows
error fix (if you use tensorflow2) from https://github.com/keras-team/keras/tree/master/
fix: keras --> tensorflow.keras (for https://github.com/keras-team/keras)
fix:from keras.utils.data_utils import get_file --> from tensorflow.python.keras.utils.data_utils import get_file
You need these parts:
(1) https://github.com/ruohoruotsi/LSTM-Music-Genre-Classification ,
(2) gtzan: https://www.kaggle.com/carlthome/gtzan-genre-collection
(3) In tensorflow under anaconda (admin) : conda>> pip install librosa #( http://librosa.org, required for turning music files to mfcc codes)
====== download and install LSTM-Music-Genre-Classification ==2020 Oct 12; 2020 Sept 2, khwong
1) download the zip file from
https://github.com/ruohoruotsi/LSTM-Music-Genre-Classification
2) unzip to a directory , assume "LSTM-Music-Genre-Classification-master_ok_test2"
3) Assume you are working on
C:\_projects\_5d3_tensorflow_ok\tensorflow_tested_ok_200804\LSTM-Music-Genre-Classification-master_ok_test2
You may delete all .npy files in
...\LSTM-Music-Genre-Classification-master_ok_test2\gtzan .
Reason: if you don't use new sound files, you may use the original .npy data from sound files preinstalled
by the author of this githib software for testing you LSTM software.
4) Download gtzan data set from https://www.kaggle.com/carlthome/gtzan-genre-collection
and unzip to ...\LSTM-Music-Genre-Classification-master_ok_test2\gtzan
5) Select and arrange you test sound files, save them in the following directories:
test (20%), train(60%),validation(20) data, using this percentage arrangement or some other choices.
...\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_test
...\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_train
...\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_validation
============How this software works===================
tf-cpu (or tf-gpu) >> python python lstm_genre_classifier_keras.py
If you run
>python lstm_genre_classifier_keras.py
it will read all audio files .au from
C:\_projects\_5d3_tensorflow_ok\tensorflow_tested_ok_200804\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_test
C:\_projects\_5d3_tensorflow_ok\tensorflow_tested_ok_200804\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_train
C:\_projects\_5d3_tensorflow_ok\tensorflow_tested_ok_200804\LSTM-Music-Genre-Classification-master_ok_test2\gtzan\_validation
The sound files will be turned into .npy (python data) in
C:\_projects\_5d3_tensorflow_ok\tensorflow_tested_ok_200804\LSTM-Music-Genre-Classification-master_ok_test2\gtzan
That's why before you use this software , delete the original .npy files. But if you don't use new sound files,
you may use the original .npy files for testing you LSTM software.
==== if you want to add more lstm layers, line 92 of lstm_genre_classifier_keras.py========
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, input_shape=input_shape))
#model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=True))#return_sequences=True if you have mor elayers
#model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=True))#added
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))#added
model.add(Dense(units=genre_features.train_Y.shape[1], activation="softmax"))
========== data structure ==============================
in GenreFeatureData.py
it defines the types: and see the files in
\_train
\_test
\_validate
the files have the type such as 'classical', 'disco' etc prefixed.
class GenreFeatureData:
"Music audio features for genre classification"
hop_length = None
genre_list = [
"classical",
"country",
"disco",
"hiphop",
"jazz",
"metal",
"pop",
"reggae",
]
(See above, you may edit this file (GenreFeatureData.py ) to change the directory structure and names used,
Recommendation: not to change it
#################################################################################################
General description of the Music genre classification demo program in https://github.com/ruohoruotsi/LSTM-Music-Genre-Classification, See
https://sites.google.com/site/hongslinks/tensor_windows/music-genre-classification-tool
· Sound file handling: It uses Librosa (librosa.org) to convert sound files (.au format) to MFCC (Mel-frequency cepstral coefficients ). Data is save in ..\gztan\*.npy format (python numpy format). When you run lstm_genre_classifier_keras.py, it will earch for .npy under ..\gztan\, if not found , it will convert sound file s(.au) from ..\gtzan\_train, \gtzan\_train to become the suitbale .npy files suhc as data_train_input.npy, data_test_input.npy, they will be used for training and testing.
· How to encode the class for samples: the class for each input is placed at the file name. Such as classical.00000.au, classical.00000.au, the class is Classical, the program uses “.” To separate the class and the testing sample serial number of the sound file. f you want to change the class name:
o Change Line 10 of GenreFeatureData.py, i.e., ‘Classical à ‘AnotherClassName’ etc. Do it for the other class name, such as ‘Hihop’, ‘Jazz’ etc.
o Also change files name in ..\gztan\_tarin, ..\gztan\test accordingly etx.
· How to change the network architecture: The orginal system has 2 LSTM layers, you may add more to it to improve result: See line92 of lstm_genre_classifier_keras.py
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, input_shape=input_shape))
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))
model.add(Dense(units=genre_features.train_Y.shape[1], activation="softmax"))
Change “return_sequences=False” is for the final hidden layer. Change to “return_sequences=False” to “return_sequences=True” if it is not the last layer. Try and figure out how to add layer yourself.
================= analysis of GenreFeatureData.py=============
from line 53 in GenreFeatureData.py
# compute minimum timeseries length, slow to compute, caching pre-computed value of 1290
# self.precompute_min_timeseries_len()
# print("min(self.timeseries_length_list) ==" + str(min(self.timeseries_length_list)))
# self.timeseries_length = min(self.timeseries_length_list)
self.timeseries_length = (
256 #orginal is 128 frames #tesed max is 1293, timeseries_length=1293*(512/22050)=30.0234s (classical.0000.au is 30.013s, confirmed)
) # sequence length == 128, default fftsize == 2048 & hop == 512 @ SR of 22050 , one hop (time between two frames)=512/22050=0.0232s, so 128 frames measn=128 *0.0232s=2.9696s. ( I don't understand in the original comment it said it is 3.065 s)
# hop == 512 @ SR of 22050 , these are preset by Librosa, SR=sampling rate =22050 irrespective what your source sampling rate, it will resample to 22050 after loaded your file into librosa
# ?? equals 128 overlapped windows that cover approx ~3.065 seconds of audio, which is a bit small!
# hop=512samples=512/22050 ms,sample len(30s for an .au file),
# 30000/23.22=1292 frames
#total length of a. au file of 30s =30/(512/22050) =1291.9921875, using 128 is too small
def load_preprocess_data(self):
print("[DEBUG] total number of files: " + str(len(self.timeseries_length_list)))
# Training set
self.train_X, self.train_Y = self.extract_audio_features(self.trainfiles_list)
with open(self.train_X_preprocessed_data, "wb") as f:
np.save(f, self.train_X)
with open(self.train_Y_preprocessed_data, "wb") as f:
self.train_Y = self.one_hot(self.train_Y)
np.save(f, self.train_Y)
# Validation set
self.dev_X, self.dev_Y = self.extract_audio_features(self.devfiles_list)
with open(self.dev_X_preprocessed_data, "wb") as f:
np.save(f, self.dev_X)
with open(self.dev_Y_preprocessed_data, "wb") as f:
self.dev_Y = self.one_hot(self.dev_Y)
np.save(f, self.dev_Y)
# Test set
self.test_X, self.test_Y = self.extract_audio_features(self.testfiles_list)
with open(self.test_X_preprocessed_data, "wb") as f:
np.save(f, self.test_X)
with open(self.test_Y_preprocessed_data, "wb") as f:
self.test_Y = self.one_hot(self.test_Y)
np.save(f, self.test_Y)
def load_deserialize_data(self):
self.train_X = np.load(self.train_X_preprocessed_data)
self.train_Y = np.load(self.train_Y_preprocessed_data)
self.dev_X = np.load(self.dev_X_preprocessed_data)
self.dev_Y = np.load(self.dev_Y_preprocessed_data)
self.test_X = np.load(self.test_X_preprocessed_data)
self.test_Y = np.load(self.test_Y_preprocessed_data)
def precompute_min_timeseries_len(self):
for file in self.all_files_list:
print("Loading " + str(file))
y, sr = librosa.load(file)
self.timeseries_length_list.append(math.ceil(len(y) / self.hop_length))
def extract_audio_features(self, list_of_audiofiles):
data = np.zeros(
(len(list_of_audiofiles), self.timeseries_length, 33), dtype=np.float64
)
target = []
for i, file in enumerate(list_of_audiofiles):
y, sr = librosa.load(file) #y= sound data , sr = sampling rate
#print('661794 by expeirment , 661794/22050=30.133 sec, ok, len(y)=')
#print(len(y))
mfcc = librosa.feature.mfcc(
y=y, sr=sr, hop_length=self.hop_length, n_mfcc=13
)
spectral_center = librosa.feature.spectral_centroid(
y=y, sr=sr, hop_length=self.hop_length
)
chroma = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=self.hop_length)
spectral_contrast = librosa.feature.spectral_contrast(
y=y, sr=sr, hop_length=self.hop_length
)
splits = re.split("[ .]", file)
genre = re.split("[ /]", splits[1])[3]
target.append(genre)
#use all 13 mfcc features including mfcc0
data[i, :, 0:13] = mfcc.T[0:self.timeseries_length, :] #orginal
data[i, :, 13:14] = spectral_center.T[0:self.timeseries_length, :]
data[i, :, 14:26] = chroma.T[0:self.timeseries_length, :]
data[i, :, 26:33] = spectral_contrast.T[0:self.timeseries_length, :]
#####################################################################################
### if you skip mfcc0, in GenreFeatureData.py ###################################
# data[i, :, 0:12] = mfcc.T[0:self.timeseries_length, 1:13] #khw, skip mfcc0, .T is transpose
# data[i, :, 13-1:14-1] = spectral_center.T[0:self.timeseries_length, :]
# data[i, :, 14-1:26-1] = chroma.T[0:self.timeseries_length, :]
# data[i, :, 26-1:33-1] = spectral_contrast.T[0:self.timeseries_length, :]
#####################################################################################
###########in predict_example.py ############ chnage the follwoing
features[0, :, 0:12] = mfcc.T[0:timeseries_length,1:13]
features[0, :, 13-1:14-1] = spectral_center.T[0:timeseries_length, :]
features[0, :, 14-1:26-1] = chroma.T[0:timeseries_length, :]
features[0, :, 26-1:33-1] = spectral_contrast.T[0:timeseries_length, :]
# dd1,dd2= mfcc.shape
# print('it is mfcc features = 13, dd1=')
# print(dd1)
# print('661794 total_samples/512_hop ~=1292.56, 1293 by expeirmentdd2=')
# print(dd2)
# print('data.size=')
# print(data.size)
# d1,d2,d3 = data.shape
# print('d1=')
# print(d1)
# print('d2=')
# print(d2)
# print('d3=')
# print(d3)
# print('len(data[i, :,0]=')
# print(len(data[i, :, 0]))
# print('i=')
# print(i)
# print('Total num of files in ..\gztan\_train\=len(list_of_audiofiles),=')
# print(len(list_of_audiofiles))
# input('pause')