playlist_dict: A dictionary with playlist ids as the keys, and a list of name of the playlist, track ids, corresponding track names and artist of the track included in the playlist as values.playlist_dict = {}for i in range(100): with open("mpd.v1/data/mpd.slice.{}-{}.json".format(i*1000,(i+1)*1000-1),'r') as dt: data = json.load(dt) for playlist in data['playlists']: playlist_dict[playlist['pid']] = [playlist['name'], [[tracks['track_uri'],tracks['track_name'],tracks['artist_name']] for tracks in playlist['tracks']]]idx_playlist: A dictionary with playlist ids as the keys, and playlist names as the values.idx_playlist = {playlist: playlist_dict[playlist][0] for playlist in playlist_dict}n_playlist = len(idx_playlist)track_name: A dictionary with track ids as keys, and track names as values.track_artist: A dictionary with track ids as keys, and artists of the tracks as values.track_name_list = []track_artist_list = []for i,v in playlist_dict.items(): track_name_list += (np.array(v[1])[:,0:2]).tolist() track_artist_list += (np.array(v[1])[:,(0,2)]).tolist()track_name = {tup[0]:tup[1] for tup in track_name_list}track_artist = {tup[0]:tup[1] for tup in track_artist_list}idx_track: A dictionary with track index as keys, and track ids as values.track_idx: A dictionary with track ids as keys, and track index as values, reverse of idx_track.trackids = []for i in range(n_playlist): trackids += list(chain(np.array(playlist_dict[i][1])[:,0]))n_trackids = len(trackids)unique_trackids = set(trackids)n_unique_trackids = len(unique_trackids)idx_track = {idx:trackid for (idx,trackid) in enumerate(unique_trackids)}track_idx = {trackid:idx for (idx,trackid) in idx_track.items()}playlist_song_pair_train: 4744137 unique (playlist index, track index) pairs in our training datasetplaylist_song_pair_val: 527126 unique (playlist index, track index) pairs in our validation dataset.playlist_song_pair_test: 1317816 unique (playlist index, track index) pairs in our test dataset.playlist_song_pair = []for i,v in playlist_dict.items(): playlist_song_pair.extend((i, track_idx[song]) for song in np.array(v[1])[:,0])playlist_song_pair_set = set(playlist_song_pair)playlist_song_pair_uniq = list(playlist_song_pair_set)random.shuffle(playlist_song_pair_uniq)n_train = int(0.8 * len(playlist_song_pair_uniq))n_val = int(0.1 * n_train)playlist_song_pair_train = playlist_song_pair_uniq[:(n_train-n_val)]playlist_song_pair_val = playlist_song_pair_uniq[(n_train-n_val):n_train]playlist_song_pair_test = playlist_song_pair_uniq[n_train:]The generator function generate_batch is designed to generate batches of training, validation and test data.
generate_batch takes :
pairs_gen : train, validation, or test playlist- song pair set)
pairs : all playlist- song pair set
n_positives : number of positive response in a batch
n_playlists : number of unique playlists
n_tracks : number of unique tracks
and returns:
A batch of mixture of pairs X in pairs_gen and pairs not in pairs, and a batch of corresponding response variables y( 1 = the playlist contains the track, 0 = the playlist does not contain the track)
import threadingdef locked_iter(it): it = iter(it) lock = threading.Lock() while True: try: with lock: value = next(it) except StopIteration: return yield valuedef generate_batch(pairs_gen, pairs, n_positives, negative_ratio, n_playlists, n_tracks): batch_size = int(n_positives * (1 + negative_ratio)) batch = np.zeros((batch_size,3)) cnt = 0 lock = threading.Lock() while True: with lock: start_time = time.time() for i,(playlist, song) in enumerate(random.sample(pairs_gen, n_positives)): batch[i,:] = (playlist,song,1) i = n_positives while i < batch_size: random_playlist = random.randrange(n_playlists) random_track = random.randrange(n_tracks) if (random_playlist, random_track) not in pairs: batch[i,:] = (random_playlist, random_track, 0) i += 1 np.random.shuffle(batch) X = {'playlist' : batch[:, 0], 'track' : batch[:,1]} y = batch[:,2] cnt += 1 yield X, ydef playlist_embedding_model(embedding_size = 75, classification = True): """Model to embed books and wikilinks using the functional API. Trained to discern if a link is present in a article""" # Both inputs are 1-dimensional playlist = Input(name = 'playlist', shape = [1]) track = Input(name = 'track', shape = [1]) # Embedding the playlist (shape will be (None, 1, 50)) playlist_embedding = Embedding(name = 'playlist_embedding', input_dim = n_playlist, output_dim = embedding_size)(playlist) # Embedding the track (shape will be (None, 1, 50)) track_embedding = Embedding(name = 'track_embedding', input_dim = n_unique_trackids, output_dim = embedding_size)(track) # Merge the layers with a dot product along the second axis (shape will be (None, 1, 1)) merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding]) # Reshape to be a single number (shape will be (None, 1)) merged = Reshape(target_shape = [1])(merged) # If classifcation, add extra layer and loss function is binary cross entropy if classification: merged = Dense(1, activation = 'sigmoid')(merged) model = Model(inputs = [playlist, track], outputs = merged) model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy']) # Otherwise loss function is mean squared error else: model = Model(inputs = [playlist, track], outputs = merged) model.compile(optimizer = 'Adam', loss = 'mse') return model# Instantiate model and show parametersmodel = playlist_embedding_model()model.summary()Model: "model_1"__________________________________________________________________________________________________Layer (type) Output Shape Param # Connected to ==================================================================================================playlist (InputLayer) [(None, 1)] 0 __________________________________________________________________________________________________track (InputLayer) [(None, 1)] 0 __________________________________________________________________________________________________playlist_embedding (Embedding) (None, 1, 75) 7500000 playlist[0][0] __________________________________________________________________________________________________track_embedding (Embedding) (None, 1, 75) 51135375 track[0][0] __________________________________________________________________________________________________dot_product (Dot) (None, 1, 1) 0 playlist_embedding[0][0] track_embedding[0][0] __________________________________________________________________________________________________reshape_1 (Reshape) (None, 1) 0 dot_product[0][0] __________________________________________________________________________________________________dense_1 (Dense) (None, 1) 2 reshape_1[0][0] ==================================================================================================Total params: 58,635,377Trainable params: 58,635,377Non-trainable params: 0__________________________________________________________________________________________________test_mod1 = next(generate_batch(playlist_song_pair_test, playlist_song_pair_set, len(playlist_song_pair_test), 2, n_playlist, n_unique_trackids))y_true = test_mod1[1]y_score = model.predict(test_mod1[0])y_pred = y_score > 0.5accuracy_score(y_true,y_pred)-----------------------------------------------------------------------------------------------------------------------------0.9244788346779823The accuracy of this model on test set is 0.924 on correctly distinguishing the affiliation relationship of the playlist and track, which is not bad. But from the plots of model accuracy and model loss against epoch, we can see that there is still some overfitting.
confusion_matrix(y_true,y_pred)-----------------------------------------------------------------------------------------------------------------------------array([[2596777, 38855], [ 259714, 1058102]])We found that the false positive rate is higher compared to false negative rate. But this result is not surprising since the the true value being 0 does not necessarily mean that this song is not qualified for being recommended to the playlist. The false positives can instead being interpreted as not already in the set, but has the potential of being added to the playlist.
def modified_playlist_embedding_model(embedding_size = 75, classification = True, bias=1): """Model to embed books and wikilinks using the functional API. Trained to discern if a link is present in a article""" # Both inputs are 1-dimensional playlist = Input(name = 'playlist', shape = [1]) track = Input(name = 'track', shape = [1]) # Embedding the playlist (shape will be (None, 1, 50)) playlist_embedding = Embedding(name = 'playlist_embedding', input_dim = n_playlist, output_dim = embedding_size)(playlist) playlist_bias = Embedding(n_playlist, bias, name="playlist_bias")(playlist) # Embedding the track (shape will be (None, 1, 50)) track_embedding = Embedding(name = 'track_embedding', input_dim = n_unique_trackids, output_dim = embedding_size)(track) track_bias = Embedding(n_unique_trackids, bias, name="track_bias")(track) # Merge the layers with a dot product along the second axis (shape will be (None, 1, 1)) merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding]) input_terms = concatenate([merged, playlist_bias, track_bias]) input_terms = Flatten()(input_terms) # Reshape to be a single number (shape will be (None, 1)) # merged = Reshape(target_shape = [1])(merged) # If classifcation, add extra layer and loss function is binary cross entropy if classification: dense_1 = Dense(20, activation="relu", name = "Dense1")(input_terms) dense_1 = Dropout(0.2)(dense_1) merged = Dense(1, activation = 'sigmoid')(dense_1) model = Model(inputs = [playlist, track], outputs = merged) model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy']) # Otherwise loss function is mean squared error else: model = Model(inputs = [playlist, track], outputs = merged) model.compile(optimizer = 'Adam', loss = 'mse') return model# Instantiate model and show parametersmodified_model = modified_playlist_embedding_model()modified_model.summary()Model: "model_2"__________________________________________________________________________________________________Layer (type) Output Shape Param # Connected to ==================================================================================================playlist (InputLayer) [(None, 1)] 0 __________________________________________________________________________________________________track (InputLayer) [(None, 1)] 0 __________________________________________________________________________________________________playlist_embedding (Embedding) (None, 1, 75) 7500000 playlist[0][0] __________________________________________________________________________________________________track_embedding (Embedding) (None, 1, 75) 51135375 track[0][0] __________________________________________________________________________________________________dot_product (Dot) (None, 1, 1) 0 playlist_embedding[0][0] track_embedding[0][0] __________________________________________________________________________________________________playlist_bias (Embedding) (None, 1, 1) 100000 playlist[0][0] __________________________________________________________________________________________________track_bias (Embedding) (None, 1, 1) 681805 track[0][0] __________________________________________________________________________________________________concatenate (Concatenate) (None, 1, 3) 0 dot_product[0][0] playlist_bias[0][0] track_bias[0][0] __________________________________________________________________________________________________flatten (Flatten) (None, 3) 0 concatenate[0][0] __________________________________________________________________________________________________Dense1 (Dense) (None, 20) 80 flatten[0][0] __________________________________________________________________________________________________dropout (Dropout) (None, 20) 0 Dense1[0][0] __________________________________________________________________________________________________dense_2 (Dense) (None, 1) 21 dropout[0][0] ==================================================================================================Total params: 59,417,281Trainable params: 59,417,281Non-trainable params: 0__________________________________________________________________________________________________roc1 = roc_curve(y_true, y_score)roc2 = roc_curve(y_true, y_score_mod)auc_score1 = roc_auc_score(y_true, y_score)auc_score2 = roc_auc_score(y_true, y_score_mod)figure = plt.figure(figsize = (10,8))plt.plot(roc1[0],roc1[1],label=("Model1, AUC score ={}".format(auc_score1)))plt.plot(roc2[0],roc2[1],label=("Model2, AUC score ={}".format(auc_score2)))plt.title("ROC curve")plt.xlabel("False positive rate")plt.ylabel("True positive rate")plt.plot( [0,1],[0,1] ,"-.")plt.legend()The AUC score for the first model is 0.9649 and the AUC score for the second model is 0.9316.
Also, the first model generalizes better to validation set and test set, indicating that the first model performs slightly better than the second modified model.
Although the second model incorporates more terms and has more layers than the first one, it doesn't improve the predictive power. The inclusion of bias layer doesn't help improve performance in this case.
Therefore, we decided to adopt the first unmodified model for further analysis.
From the neural network model with embeddings, we can not only get to predict if a song should be contained in a playlist, but also the learned embeddings of playlists and songs in a dimension-reduced space. More importantly, after normalization, the dot product of weights of two items is the cosine similarity of the two items. That being said, given a playlist (or song), we can find the most similar playlists (or songs) from the full set and recommend the similar playlist (or songs) to the user.
For this section, we used the weights obtained from model 1 above:
playlist_layer = model.get_layer('playlist_embedding')track_layer = model.get_layer('track_embedding')playlist_weights = playlist_layer.get_weights()[0]track_weights = track_layer.get_weights()[0]playlist_weights = playlist_weights / np.linalg.norm(playlist_weights, axis = 1).reshape((-1, 1))track_weights = track_weights / np.linalg.norm(track_weights, axis = 1).reshape((-1, 1))import pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineplt.style.use('fivethirtyeight')plt.rcParams['font.size'] = 15def find_similar(idx, weights, index_name = 'playlist', n = 10, least = False, return_dist = False, plot = False): """Find n most similar items (or least) to name based on embeddings. Option to also plot the results""" # Select index and reverse index if index_name == 'playlist': index = idx_playlist elif index_name == 'track': index = idx_track # Check to make sure `name` is in index try: dists = np.dot(weights, weights[idx]) except KeyError: print(f'{idx} Not Found.') return # Sort distance indexes from smallest to largest sorted_dists = np.argsort(dists) # Plot results if specified if plot: # Find furthest and closest items furthest = sorted_dists[:(n // 2)] closest = sorted_dists[-n-1: len(dists) - 1] if index_name == "track": items = [track_name[index[c]] for c in furthest] items.extend(track_name[index[c]] for c in closest) else: items = [index[c] for c in furthest] items.extend(index[c] for c in closest) # Find furthest and closets distances distances = [dists[c] for c in furthest] distances.extend(dists[c] for c in closest) colors = ['r' for _ in range(n //2)] colors.extend('g' for _ in range(n)) data = pd.DataFrame({'distance': distances}, index = items) if index_name == "track": # Horizontal bar chart data['distance'].plot.barh(color = colors, figsize = (10, 8), edgecolor = 'k', linewidth = 2) plt.xlabel('Cosine Similarity'); plt.axvline(x = 0, color = 'k'); # Formatting for italicized title name_str = f'{index_name.capitalize()}s Most and Least Similar to' for word in track_name[index[idx]].split(): # Title uses latex for italize name_str += ' $\it{' + word + '}$' plt.title(name_str, x = 0.2, size = 28, y = 1.05) return None elif index_name == "playlist": data['distance'].plot.barh(color = colors, figsize = (10, 8), edgecolor = 'k', linewidth = 2) plt.xlabel('Cosine Similarity'); plt.axvline(x = 0, color = 'k'); name_str = f'{index_name.capitalize()}s Most and Least Similar to' for word in index[idx].split(): name_str += ' $\it{' + word + '}$' plt.title(name_str, x = 0.2, size = 28, y = 1.05) if least: # Take the first n from sorted distances closest = sorted_dists[:n] print(f'{index_name.capitalize()}s furthest from {index[idx]}.\n') # Otherwise find the most similar else: # Take the last n sorted distances closest = sorted_dists[-n:] # Need distances later on if return_dist: return dists, closest if index_name == "playlist": print(f'{index_name.capitalize()}s closest to {index[idx]}.\n') else: print(f'{index_name.capitalize()}s closest to {track_name[index[idx]]}.\n') # Need distances later on if return_dist: return dists, closest # Print formatting max_width = max([len(index[c]) for c in closest]) # Print the most similar and distances for c in reversed(closest): if index_name == "playlist": print(f'{index_name.capitalize()}: {index[c]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}') else: print(f'{index_name.capitalize()}: {track_name[index[c]]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}')Now, we would like to see how this method of item-to-item or user-to-user recommendation works by exploring the million playlist dataset with our searching algorithm find_similar()
First, we want to should do playlist-to-playlist recommendation:
# Top 10 most common playlist nameslist(count_items(idx_playlist.values()))[:10]-----------------------------------------------------------------------------------------------------------------------------['Country', 'Chill', 'country', 'chill', 'Christmas', 'Rap', 'Workout', 'Rock', 'Oldies', 'workout']Good, let's check what we will get if we try to find the similar playlists for the first "chill" playlist (with playlist_id =67)
find_similar(67, playlist_weights, "playlist")-----------------------------------------------------------------------------------------------------------------------------Playlists closest to chill.Playlist: chill (67) Similarity: 1.0Playlist: Sleep (60872) Similarity: 0.73Playlist: chilllllll (3155) Similarity: 0.72Playlist: love (263) Similarity: 0.72Playlist: my style (79028) Similarity: 0.72Playlist: Vibes (83919) Similarity: 0.72Playlist: deep (10822) Similarity: 0.71Playlist: slow (95088) Similarity: 0.71Playlist: cameron. (79791) Similarity: 0.71Playlist: Ocean (95034) Similarity: 0.71Although the playlist names are user-defined and the name does not necessarily indicate the true extent of that playlist sometimes, we can still find some pattern here: The playlists the most similar to playlist "chill" seem to have "softer" names like: "Sleep", "chilllllll", "deep", "softer"...
What about playlist "rock" (playlist_id = 580)?
find_similar(580, playlist_weights, "playlist")-----------------------------------------------------------------------------------------------------------------------------Playlists closest to rock.Playlist: rock (580) Similarity: 1.0Playlist: Classic Rock (51536) Similarity: 0.91Playlist: Classic Rock!!!! (89066) Similarity: 0.9Playlist: Oldy (31379) Similarity: 0.9Playlist: Oldies (54852) Similarity: 0.9Playlist: rock (40893) Similarity: 0.9Playlist: Rock Out (44510) Similarity: 0.89Playlist: classics (98893) Similarity: 0.89Playlist: Party (80299) Similarity: 0.89Playlist: classic rock (38917) Similarity: 0.89By comparing the distribution of similarity between the first "rock" playlist and other "rock" playlists and the similarity between the first "rock" playlists and other "chill" playlists, we can clearly see that "rock" playlist is more similar to most "rock" playlists than "chill" playlists.
Example: We want to recommend songs to users who love Halo by Beyoncé:
find_similar(track_idx["spotify:track:2CvOqDpQIMw69cCzWqr5yr"],track_weights, "track",20)-----------------------------------------------------------------------------------------------------------------------------Tracks closest to Halo.Track: Halo (222585) Similarity: 1.0Track: Irreplaceable (457932) Similarity: 0.93Track: No One (70271) Similarity: 0.92Track: Best Thing I Never Had (221610) Similarity: 0.92Track: If I Were a Boy (608553) Similarity: 0.92Track: Take A Bow - Main (117995) Similarity: 0.91Track: Bleeding Love (222284) Similarity: 0.91Track: Love On Top (261244) Similarity: 0.91Track: No Air (673503) Similarity: 0.9Track: Girl On Fire (255886) Similarity: 0.9Track: Stay - Album Version (Edited) (86096) Similarity: 0.9Track: Rolling in the Deep (124625) Similarity: 0.9Track: Unwritten (279580) Similarity: 0.9Track: Big Girls Don't Cry (Personal) (151650) Similarity: 0.9Track: Too Little, Too Late - Radio Version (307074) Similarity: 0.89Track: Just Give Me a Reason (529274) Similarity: 0.89Track: No Scrubs (115642) Similarity: 0.89Track: When I Was Your Man (143176) Similarity: 0.89Track: If I Ain't Got You (433781) Similarity: 0.89Track: Crazy In Love (645225) Similarity: 0.89We can also plot out the most similar and the least similar songs of a specific song, say 24K Magic by Bruno Mars:
find_similar(track_idx["spotify:track:6b8Be6ljOzmkOmFslEb23P"],track_weights, "track", 10, plot = True)By roughly searching for similar playlists and songs based on the cosine similarity derived from the NN embedding model, we are able to get a general idea of how well this crude recommending strategy works. As an approach based on collaborative filtering, embedding neural networks can relate similar songs if the items are distributed similarly among all the playlists and relate similar playlists if they have similar song compositions. This recommendation approach is very efficient among certain population of users, especially those who love to listen to the most trendy music as it's easy to search for similar trendy songs. However, this approach has a major drawback: for those less popular music including many excellent independent music, since they appear less in all the playlists, they are of lower chance being recommended to users. If we depend solely on collaborative filtering, the system would enter a filter bubble in which popular songs will become more popular whereas less popular music will become less popular. One way that could overcome this is to introduce randomness so that we do not always recommend the most relevant song. To further improve this model, we need to incorporate more content-based information, for example the audio features and genres of the songs so that the similarity of songs will not only depend on their distribution in all playlists, but also is a result of common audio features.