Hypothesis 3:
Different classification methods have similar performance over datasets when using sound features to predict playlist categories.
Motivation:
In a large pool of song tracks, creating playlists or classify song tracks into genres manually is tedious and inefficient. Machine learning techniques will make it possible to classify song tracks automatically and accurately. However, the real-world problem is complicated. Some methods may suitable while others not. To understand which methods are appropriate, we conducted a series of experiments.
Summary:
· Decision Tree
Decision tree is a non-parametric classifier with the hierarchical tree structure. It is also a series of exclusive rule settings to separate datasets into several classes. Initially, we have a full dataset, and then repeatedly split the dataset into smaller subsets. Every split is based on the result of the last.
To train a decision tree model, we need set rules to split nodes, to stop splitting, to prune an over large tree. Impurity is the key factor in splitting.
· K-Nearest neighbors
KNN is a non-parametric lazy learning algorithm and no assumption of the dataset is required. Unlike other eager learning classifiers, KNN performs classification by deciding its class based on the similarity to K nearest neighbors. Generally, we will find K nearest neighbors of an instance (for example, based on Euclidean distance) and assign the majority of its neighbors’ class to the object.
· Naïve Bayes
Naive Bayes methods are a set of supervised learning algorithms based on Bayes’ theory. It is suitable for high dimensional data. It assumes input features are independent. Class labels are assigned according to the probability P(A|C)P(C). Here, A is feature values and C is class label. Thus it is equivalent to choosing class C that maximizes P(A|C)P(C).
· SVM
SVM is a supervised learning method that constructs hyperplanes to separate datasets. It can use different kernel such linear, gamma, sigmoid to build the classifier. Intuitively, SVM build its classification boundary by maximizing the margin. For a linear kernel, the margin is the width that the boundary could be increased by beofe hitting a datapoint.
· Random Forest:
Random forest is an ensemble learning method for classification. To construct a random forest, a series of trees are constructed and give class labels independently. The output of the random forest is the majority vote of outputs of individual trees. Common methods to compute output of random forest include mean or mode. In random forest, more advanced techniques can be combined. However, the complexity of the model makes it harder to interpret.
In the next section, we will test every method above on our dataset.
Experiment:
In the dataset, songs can be assigned to different classes, so we will use binary classifiers to predict if a song track can be classified as a certain category. This means the whole dataset will be split using the Parentcat feature. All the basic classification methods will be applied to each genre dataset so that performance of all methods on every dataset can be compared. For each method, we use cross validation when training, calculate confusion matrix as well as ROC curves on test datasets. After presenting and discussing results for each method, there is a comparison of all methods at the end. We take party dataset as an example to illustrate all classification techniques and thus only display the result for party category. All other confusion matrix and accuracy reports can be found in Classification.txt.
1) Decision tree:
Model setting:
To control the depth of the tree, we set max depth of trees as 4. When training, cross-validation is used. After that, we use another dataset to test the model and get the confusion matrix as well as ROC curve.
Input Features:
duration_ms, popularity, acousticness, dancebility, energy, instrumentalness, key, liveness, loudness, mode,speechesness, tempo, time signature, valence
Class:
parentCat. 1 represents the case belong to the specific genre, 0 represents the case doesn’t belong to the specific genre.
Result:
Tree Presentation:
Since there are 29 datasets after splitting based on playlist categories (genres), 29 trees have been created. Here we will only present the result for the category Party(zoom to have a closer look), all other results can be found in the file folder DecisionTreeGraph.
Decision tree model is very intuitive and easy to interpret. In the tree of Party tracks, time signature is chosen to be a root node, so this feature will give most information gain when splitting. From the tree, lower time signature is more likely for party music. Moreover, song tracks with popularity less than 0.631 are not in party playlist most of time. Thus, this model builds exclusive classification rules to separate data. For example, when time signature<0.616 & popularity<0.51(Normalized data), the label party can be assigned to the case.
2) K nearest neighbor
Model setting:
In k nearest neighbor, we set K=5(default).
Input features:
duration_ms, popularity, acousticness, dancebility, energy, instrumentalness, key, liveness, loudness, mode,speechesness, tempo, time signature, valence
Class:
parentCat: 1 represents the case belong to the specific genre, 0 represents the case doesn’t belong to the specific genre.
Result:
3) Naive Bayes
Model setting: Assume all feature are independent.
Input features:
duration_ms, popularity, acousticness, dancebility, energy, instrumentalness, key, liveness, loudness, mode,speechesness, tempo, time signature, valence
Class:
parentCat. 1 represents the case belong to the specific genre, 0 represents the case doesn’t belong to the specific genre.
Result:
4) SVM
Model setting: Linear kernel will be used.
Input features:
duration_ms, popularity, acousticness, dancebility, energy, instrumentalness, key, liveness, loudness, mode,speechesness, tempo, time signature, valence
Class:
parentCat. 1 represents the case belong to the specific genre, 0 represents the case doesn’t belong to the specific genre.
Result:
5) Random Forest
Input features:
duration_ms, popularity, acousticness, dancebility, energy, instrumentalness, key, liveness, loudness, mode,speechesness, tempo, time signature, valence
Class:
parentCat. 1 represents the case belong to the specific genre, 0 represents the case doesn’t belong to the specific genre.
Result:
ROC curve comparison and performance discussion:
ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The larger the area under curve, the better the classifier is. Here we compare the performance of 5 classifiers on the party dataset using ROC curve. All other ROC curve plots can be found in the file folder ROCCurve.
From the graph, all the classification methods have good performance on the dataset, with the area under curve from 0.85 to 0.9. From the plots and tables above, it is evident that decision tree worked the best out of all of the algorithms that we implemented.
However, all of the algorithms were still fairly successful classifying the genres. What is unexpected is that SVM has less satisfactory accuracy. Generally, SVM is a powerful classifier. The reason that it is inferior to other methods might because linear kernel is not a good hyperplane for this dataset.
Overall comparison:
In order to have a closer look at classification results, we take the party dataset as an example above. Next, we will give a summary of 5 classification methods on all genres datasets. To evaluate the performance for every classifier, precision on both training and test datasets are displayed.
In training and test data, classification methods have different performance in different genres. Some genres have high accuracy while others not. This may come from the characteristics of each genres. For example, Comedy music has highest accuracy. Some features such as speechiness significantly distinguish with other genres, making it is easy to be identified. This is consistent with common sense.
Generally, naïve bayes performes badly compared with others. Based on Bayesian theorem, it requires features are independent. However, the previous analysis shows there are closely related features in our dataset. This undermines the power of Naïve bayes. Other classification methods are all fairly successful for the classification of genres.
In summary, we conduct a series of binary classification using 5 methods over 29 datasets. Full results are in attached file folders. The classification results are relatively strong considering datasets have balanced classes. All classification methods work well except that Naïve bayes are less accurate.
Accuracy might be further improved if some advanced techniques are introduced. This will be the future work.
Conclusion
We have confirmed the predictive power of audio features on music genre throughout our analysis and selected suitable classification techniques. Our next step will focus on incorporating music ranking with audio features. We aim to come up insights about music trending with respect to artist, music genre, and audio features.