Models

Train and Test Data

It is not computationally feasible to train our models on the entire playlist dataset. With this in mind, we build a pipeline to extract random samples of playlists and songs from the million playlist dataset, and get the relevant features for each song from the Spotify API (we use the Spotipy library to succinctly interact with the Spotify API).

Before any random process, we set the random seed. We do so to ensure reproducibility.

Spotify API and Spotipy

We begin with a function to instantiate a Spotipy Spotify API interface with our credentials. Our credentials are purposefully censored.

Next we build a master track features function get_train_data. When given a Pandas data frame of tracks (much like what we find in the million playlist dataset), this function retrieves relevant track features and one-hot encodes the necessary categorical variables. There are several technical considerations such as the fifty-track upper limit on API track feature calls and the time limit on API credentials. When necessary, this function instantiates a new Spotify API interface with instantiate_api. We often call this function on small collections of songs, therefore we manually ensure that the returned feature matrix has all possible one-hot encoded levels.

Million Playlist Dataset

Spotify's Million Playlist dataset is stored in a series of 1000 files, each containing 1000 playlists. Below we define a function get_playlists which takes two numbers as arguments. The first number is the number of files the user would like the function to collect data from. The second number is the number of playlists to collect from each file. The files are chosen randomly, and the playlists are chosen randomly within each file. We augment the playlist id of each song as follows: file number (from 0 to 999) followed by playlist id within its file separated with an underscore. The random seed used to select songs within each playlist is iteratively updated. This is so that we do not choose playlists in the same location in each file. We use this function to collect random subsets of the Million Playlist dataset in order to train models.

We now define a function get_songs to collect a random subset of songs from the Million Playlist dataset. This function is mechanically similar to get_playlists, however, it only returns relevant song variables. Again, we iteratively update the in-file random seed.

With our pipeline built, we can collect data, build train and test datasets, and begin to build models.

Train and Test Playlists

We first randomly select 20 complete playlists from 50 different files for a total of 1000 complete playlists. These playlists included a total of 69516 songs.

get_train_data allows us to get song features. Using Sklearn's train_test_split, we split the playlist dataset into training and test sets (using an 80-20 split). Here we also select our baseline response variable: playlist id. The resulting design matrices and vectors of responses for train and test data are structured accordingly. Importantly, we stratify on playlist id. This splits the songs in each playlist between the train and test datasets.

Now we need a collection of songs from which we will select our recommendations. We use get_songs and get_train_data to build our potential songs dataset.

Before modeling, we build a series of functions to efficiently report and record function results and report playlist recommendations.

Baseline Models

The models below are trained on Spotify track features. They are designed to determine which playlist a particular song belongs to. We build a list of recommended songs for a given playlist by sorting the prediction probabilities for a particular playlist in descending order. Intuitively, the song that a given classification model predicts to be a member of a given playlist with the highest probability is that model's best guess at which song would best belong in that playlist. These models are not properly tuned as their purpose is exploratory.

Hypothesis

These models attempt to learn 1000 classes. This is not a reasonable expectation, and thus we hypothesize that the train and test scores will not be satisfactory in order to make reasonable recommendations. Furthermore, the following models are likely overfit (when there is sufficient data) to playlists, and fail to learn the general purpose of a playlist.

A Quick Note on Tuning

We do not tune the hyper-parameters of our baseline models. These models serve an exploratory purpose rather than a predictive purpose. Moreover, given the number of classes in each model, it is computationally impractical to tune baseline hyper-parameters. We will tune our final models.

Sanity Check

In order to have a "human-readable" metric for each model's comparisons, we show the first 5 songs in a random playlist in the dataset. After fitting each model, we shall show that model's top 5 recommendations for this playlist. It appears to be a playlist centered around hard rock or metal music.

Logistic Regression

We begin with a cross-validated logistic regression model using Ridge regularization (here we use 5-fold cross validation). This model has a train set accuracy score of 0.009 and a test set accuracy score of 0.008. Here we note that Sklearn threw an error concerning too few samples per class when training this model. We address this issue in the next section.

K-Nearest Neighbors Classification

Next we consider a simple KNN classification model (we do not tune k). This model has a train set accuracy score of 0.184 and a test set accuracy score of 0.006.

Random Forest Classification

Now we consider a simple Random Forest model (we do not tune the number of estimators nor the complexity of each tree). This model has a train set accuracy score of 0.572 and a test set accuracy score of 0.046.

Boosted Tree Classification

Finally, we consider a simple Adaboost model (we do not tune the number of estimators nor the learning rate). This model has a train set accuracy score of 0.006 and a test set accuracy score of 0.006.

Baseline Results

Based solely on each model's accuracy score on the test set, the random forest model best differentiates between the overall scope of the playlists. This implies that a random forest model recommends the most relevant songs. We see some evidence of this as the random forest model's top 5 recommendations includes a few hard rock/metal songs. This is, of course, a preliminary analysis and below we begin to look into fixing potential issues.

All models had very low test set accuracy scores (less than 0.05). With the current structure of our response, our models cannot make consistently relevant recommendations. Our hypothesis seems to be correct.

Baseline Model Issue

In our baseline models, the response variable is the playlist id. These models sought to predict the playlist to which a particular song would likely belong. From these models, we used the predicted probability for a given playlist to extract a list of potential songs.

This approach has several glaring disadvantages:

The model's predictions are entirely dependent on the included playlists in the training set. This severely limits our "cold-start" ability. In other words, playlists with few songs have insufficient training samples in order to make reasonable predictions. Furthermore, the relative predicted probabilities for a given playlist from a classification algorithm will likely change if we include or exclude a song or playlist. Finally, similar playlists could potentially receive vastly different recommendations. In order to solve these issues, we could potentially fit our models on all possible playlists and songs, however, this is computationally impractical and we are not guaranteed to see the best song recommendations.
Our baseline models can only make recommendations for playlists the model was trained on. This severely limits practicality and scalability; in order to provide recommendations for a new playlist, we would have to retrain the models on all data.
Perhaps the most vital issue with our baseline models is the number of classes. If we wish to make some recommendations for a given playlist, we must include that playlist as well as many other in order to cover a relatively representative sample of Spotify playlists. Moreover, most playlists in our dataset do not have a sufficient number of songs for any model to accurately learn. We can see the consequences of our lack of data in the error messages thrown by fitting our baseline logistic regression model. Our baseline models lack statistical power.

In short, we need a more general response variable in order to satisfy the computational and theoretical demands of classification algorithms as well as the nature of our dataset.

There are several potential strategies we could use to combat these concerns. One potential solution is to fit models when given a playlist we would like recommendation for. In other words, when asked to provide recommendations for a playlist, our response becomes a binary categorical variable with ones for all songs already in the playlist. We then fit a classification model and make recommendations. The response would become binary. This does not, however, solve our "cold-start" problem as a new playlist will not have sufficient song samples for any model to learn. Another potential solution is to define groups of playlists based on genre or music theory. Below we consider a more data-driven approach.

Clustering

Our Solution: Clustering

We will cluster playlists based on aggregate statistics, and use the resulting clusters as our response variable. This approach has several advantages relative to our baseline models. Importantly, we greatly reduce the number of classes which also increases the number of songs per class. These factors increase the statistical power of our models. This also reduces the computational burden of training our classification models. Additionally, sufficiently similar playlists will be in the same cluster, therefore they can freely receive the same recommendations. Finally, the unsupervised nature of clustering allows us to recommend songs for playlists the model was not explicitly trained on.

There are several important considerations to clustering our data:

Which algorithm do we use?
What aggregate metrics will we use to cluster playlists?
If necessary, how do we determine the number of clusters?
How will we verify the validity of the resulting clusters?
How will we structure the resulting models?

Below we dive into our implementation and work through these considerations.

We will use Scikit's implementation of K-Means clustering. K-Means is a good general-purpose clustering algorithm which suits our needs.

Hypothesis

Clustering will simplify the problem greatly. The resulting models should thus have more power to differentiate between playlists. The test-set performance should reflect this.

Aggregate Playlist Statistics

We consider all song features we consider in our models in our clustering algorithm.

We now have to derive aggregate statistics for each playlist based on their song features. The quantitative features of songs given by Spotify are bounded, therefore outliers are not a concern and we can aggregate using a simple average.

We need to take care in how we handle the categorical features. There are three to consider: time signature, mode, and key. Initially, one might opt for the mode of categorical variables, however, we can dismiss this option with an example. Suppose a 100-song playlist is composed of 49 songs in minor mode and 51 songs in major mode. We do not want to assign this playlist a major mode. If this were the case, our algorithm would consider this playlist much more similar to a playlist of all songs in major than a playlist with 51 songs in minor and 49 songs in major (based on mode alone). In order to handle such similarities, we will again aggregate using a simple mean (after one-hot encoding). In essence, we aggregate categorical song features into proportions.

We define a simple function aggregate_playlists to aggregate playlist statistics for efficiency.

Number of Clusters

It is difficult to determine an appropriate number of clusters on theory alone. There are over 1200 distinct genres of music (https://www.theguardian.com/music/2014/sep/04/-sp-from-charred-death-to-deep-filthstep-the-1264-genres-that-make-modern-music), and there are likely many more types of playlists. Ideally we would like to identify fewer clusters in order to reduce the dimensionality of our response variable. To this end, we apply data-driven heuristics.

We will use two iterative methods to determine how many clusters to include: the Elbow Method, and the Silhouette Method. These methods require us to cluster using multiple choices of number of clusters. The Elbow Method calculates an error metric (often the sum of squared error or distortion) within clusters (relative to the cluster center) and plots this error as a function of the number of clusters used. If clustering is not successful, the decrease in error will be linear. If, however, there is a an inflection point, then the algorithm is identifying reasonably distinct clusters. This elbow plot will indicate whether or not clustering is appropriate.

We will determine the most appropriate number of cluster with the Silhouette Method. This method is analogous to the Elbow Method, however it only compares neighboring clusters. The Scikit implementation uses the mean ratio of intra-cluster and nearest-cluster distance. Again, this value is plotted against number of clusters, however, now we look for the k which maximizes the silhouette coefficient (a symptom of Scikit's formulation of the error statistic).

All cluster model training is done on the train set.

Our Elbow plot is non-linear! This indicates that we have good clustering for reasonable values of k (25 - 40). We now use a Silhouette plot to determine the precise k we want to use. We check possible k values between 4 and the upper bound determined by the Elbow plot.

From the Silhouette plot, we choose k = 34.

A Quick Note on Clustering:

If we increased the upper bound of k, we are likely to see higher values for the Silhouette score. This highlights the importance of using the Elbow plot and the Silhouette plot in tandem. The Elbow plot indicates that clustering at the level of k = 34 is reasonable in order to capture core characteristics of playlists in our data. In other words, the Elbow plot gives us a reasonable set of bounds. The Silhouette plot helps us determine a specific k within those bounds. Adding many more clusters will always reduce error, however, we are trying to reduce the dimensions of of our response, therefore we select a reasonably low number of clusters.

Clustering Model

Here we train our clustering model. This model will determine the cluster a playlist belongs to before our classification models determine recommendations.

Augmented Response

Armed with our clusters, we now build our new response. We also verify the efficacy of our new response augmentation by looking at the average songs per playlist and the average songs per cluster. As expected, the clusters have many more songs (on average) than the playlists. This concentration of data and decrease in class number bolsters the statistical power of our models.

The train and test playlists have average song counts of 56 and 14 respectively. The train and test clusters have average song counts of 1636 and 409 respectively. The consistent discrepency between train and test values is due to stratification.

Validity of Clusters: PCA

We verify the validity of our clusters with PCA. If our clusters are capturing meaningful variability in the data, then we should see defined clusters on the plot of the first two principal components.

We do not label clustering as unsupervised, therefore we have no meaningful reference for the resulting clusters.

In both the train and test sets, PCA plot shows distinct, connected clusters. Furthermore, clusters vary little in the direction of the first principal axis. This is to be expected as the first principal axis captures the most variability across the playlist statistics; we expect playlists in the same cluster to have similar first principal component scores. Overall, we have reasonable clusters and can now begin to model.

Modeling

Model Structure

Our augmented response requires a different model structure. Given a playlist, we first determine which cluster it belongs to using the clustering model trained on the train data. Then our classification models will determine the relative probability that any given song belongs in the aforementioned cluster. The most probable song (as long as it is not already in the playlist) is our top recommendation.

Pipeline

As with our baseline models, we first define a few quality of life functions. The first reports the results of each model and the second returns a table of recommendations.

Tuning

When computationally feasible, our models are tuned with an exhaustive grid search of relevant hyper-parameters. When this is not possible, they are tuned with a randomized search of relevant hyper-parameters.

Sanity Check

Here we select a random test playlist which will serve as a manual sanity check on our recommendations. We did the same for our baseline models, however, due to the nature of our baseline response, we were forced to use a test playlist in our train dataset. The test playlist below is selected completely at random from the Million Playlist dataset. Below we display the first 5 songs in our test playlist. It appears to contain mostly classic rock.

Logistic Regression

As in our baseline models, we begin with logistic regression. We tune the penalty term (l1 vs l2) as well as the regularization parameter. We find that the optimal regularization is l1 with a strong regularization parameter (C = 1). This model has an accuracy score of 0.085 on the train set and an accuracy score of 0.071 on the test set.

K-Nearest Neighbors Classification

In our KNN model, we tune the number of neighbors. We found the optimal number of neighbors to be 12. This model has an accuracy score of 0.214 on the train set and an accuracy score of 0.070 on the test set.

Random Forest Classification

We tune the number of trees as well as the method to determine the number of features to use in each tree. We find that the optimal model has 200 trees and uses the square root of the number of predictors to determine how many predictors per tree. This model has an accuracy score of 0.637 on the train set and an accuracy score of 0.081 on the test set.

Boosted Random Forest Classification

We tune the base estimator (the depth of the base decision tree) as well as the learning rate. We find that the optimal model has a base estimator decision tree with depth 2 and a learning rate of 0.5. This model has an accuracy score of 0.101 on the train set and an accuracy score of 0.075 on the test set.

Neural Network

In our baseline model structure, it is not computationally practical to train a neural network. Our clustered response allows us to do so. Below we construct neural network with 3 hidden layers, each with 100 nodes. These layers use the relu activation function. The output layer uses the softmax activation function. This model has an accuracy score of 0.053 on the train set and an accuracy score of 0.054 on the test set.

The similarity of performance between the train and test set indicates that this network is not overfit, and thus does not require regularization. With a higher learning rate, the same model is capable of achieving upwards of 0.08 test set accuracy, however, this is because the model simply predicts the most numerous cluster always. This is, of course, not the desired outcome, therefore we manually lowered the learning rate.

Results

Below we organize the test and train scores of our models. The cluster-based models performed much better than our baseline models (often by an entire order of magnitude). According to test-set performance alone, the random forest model still performs the best. Again, the recommendations of the random forest model seem to be most relevant.

Page updated

Google Sites

Report abuse