playlist_dict
: A dictionary with playlist ids as the keys, and a list of name of the playlist, track ids, corresponding track names and artist of the track included in the playlist as values.playlist_dict = {}
for i in range(100):
with open("mpd.v1/data/mpd.slice.{}-{}.json".format(i*1000,(i+1)*1000-1),'r') as dt:
data = json.load(dt)
for playlist in data['playlists']:
playlist_dict[playlist['pid']] = [playlist['name'],
[[tracks['track_uri'],tracks['track_name'],tracks['artist_name']]
for tracks in playlist['tracks']]]
idx_playlist
: A dictionary with playlist ids as the keys, and playlist names as the values.idx_playlist = {playlist: playlist_dict[playlist][0] for playlist in playlist_dict}
n_playlist = len(idx_playlist)
track_name
: A dictionary with track ids as keys, and track names as values.track_artist
: A dictionary with track ids as keys, and artists of the tracks as values.track_name_list = []
track_artist_list = []
for i,v in playlist_dict.items():
track_name_list += (np.array(v[1])[:,0:2]).tolist()
track_artist_list += (np.array(v[1])[:,(0,2)]).tolist()
track_name = {tup[0]:tup[1] for tup in track_name_list}
track_artist = {tup[0]:tup[1] for tup in track_artist_list}
idx_track
: A dictionary with track index as keys, and track ids as values.track_idx
: A dictionary with track ids as keys, and track index as values, reverse of idx_track
.trackids = []
for i in range(n_playlist):
trackids += list(chain(np.array(playlist_dict[i][1])[:,0]))
n_trackids = len(trackids)
unique_trackids = set(trackids)
n_unique_trackids = len(unique_trackids)
idx_track = {idx:trackid for (idx,trackid) in enumerate(unique_trackids)}
track_idx = {trackid:idx for (idx,trackid) in idx_track.items()}
playlist_song_pair_train
: 4744137 unique (playlist index, track index) pairs in our training datasetplaylist_song_pair_val
: 527126 unique (playlist index, track index) pairs in our validation dataset.playlist_song_pair_test
: 1317816 unique (playlist index, track index) pairs in our test dataset.playlist_song_pair = []
for i,v in playlist_dict.items():
playlist_song_pair.extend((i, track_idx[song]) for song in np.array(v[1])[:,0])
playlist_song_pair_set = set(playlist_song_pair)
playlist_song_pair_uniq = list(playlist_song_pair_set)
random.shuffle(playlist_song_pair_uniq)
n_train = int(0.8 * len(playlist_song_pair_uniq))
n_val = int(0.1 * n_train)
playlist_song_pair_train = playlist_song_pair_uniq[:(n_train-n_val)]
playlist_song_pair_val = playlist_song_pair_uniq[(n_train-n_val):n_train]
playlist_song_pair_test = playlist_song_pair_uniq[n_train:]
The generator function generate_batch
is designed to generate batches of training, validation and test data.
generate_batch
takes :
pairs_gen
: train, validation, or test playlist- song pair set)
pairs
: all playlist- song pair set
n_positives
: number of positive response in a batch
n_playlists
: number of unique playlists
n_tracks
: number of unique tracks
and returns:
A batch of mixture of pairs X
in pairs_gen
and pairs not in pairs
, and a batch of corresponding response variables y
( 1 = the playlist contains the track, 0 = the playlist does not contain the track)
import threading
def locked_iter(it):
it = iter(it)
lock = threading.Lock()
while True:
try:
with lock:
value = next(it)
except StopIteration:
return
yield value
def generate_batch(pairs_gen, pairs, n_positives, negative_ratio, n_playlists, n_tracks):
batch_size = int(n_positives * (1 + negative_ratio))
batch = np.zeros((batch_size,3))
cnt = 0
lock = threading.Lock()
while True:
with lock:
start_time = time.time()
for i,(playlist, song) in enumerate(random.sample(pairs_gen, n_positives)):
batch[i,:] = (playlist,song,1)
i = n_positives
while i < batch_size:
random_playlist = random.randrange(n_playlists)
random_track = random.randrange(n_tracks)
if (random_playlist, random_track) not in pairs:
batch[i,:] = (random_playlist, random_track, 0)
i += 1
np.random.shuffle(batch)
X = {'playlist' : batch[:, 0], 'track' : batch[:,1]}
y = batch[:,2]
cnt += 1
yield X, y
def playlist_embedding_model(embedding_size = 75, classification = True):
"""Model to embed books and wikilinks using the functional API.
Trained to discern if a link is present in a article"""
# Both inputs are 1-dimensional
playlist = Input(name = 'playlist', shape = [1])
track = Input(name = 'track', shape = [1])
# Embedding the playlist (shape will be (None, 1, 50))
playlist_embedding = Embedding(name = 'playlist_embedding',
input_dim = n_playlist,
output_dim = embedding_size)(playlist)
# Embedding the track (shape will be (None, 1, 50))
track_embedding = Embedding(name = 'track_embedding',
input_dim = n_unique_trackids,
output_dim = embedding_size)(track)
# Merge the layers with a dot product along the second axis (shape will be (None, 1, 1))
merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding])
# Reshape to be a single number (shape will be (None, 1))
merged = Reshape(target_shape = [1])(merged)
# If classifcation, add extra layer and loss function is binary cross entropy
if classification:
merged = Dense(1, activation = 'sigmoid')(merged)
model = Model(inputs = [playlist, track], outputs = merged)
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Otherwise loss function is mean squared error
else:
model = Model(inputs = [playlist, track], outputs = merged)
model.compile(optimizer = 'Adam', loss = 'mse')
return model
# Instantiate model and show parameters
model = playlist_embedding_model()
model.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
playlist (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
track (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
playlist_embedding (Embedding) (None, 1, 75) 7500000 playlist[0][0]
__________________________________________________________________________________________________
track_embedding (Embedding) (None, 1, 75) 51135375 track[0][0]
__________________________________________________________________________________________________
dot_product (Dot) (None, 1, 1) 0 playlist_embedding[0][0]
track_embedding[0][0]
__________________________________________________________________________________________________
reshape_1 (Reshape) (None, 1) 0 dot_product[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1) 2 reshape_1[0][0]
==================================================================================================
Total params: 58,635,377
Trainable params: 58,635,377
Non-trainable params: 0
__________________________________________________________________________________________________
test_mod1 = next(generate_batch(playlist_song_pair_test, playlist_song_pair_set, len(playlist_song_pair_test), 2, n_playlist, n_unique_trackids))
y_true = test_mod1[1]
y_score = model.predict(test_mod1[0])
y_pred = y_score > 0.5
accuracy_score(y_true,y_pred)
-----------------------------------------------------------------------------------------------------------------------------
0.9244788346779823
The accuracy of this model on test set is 0.924 on correctly distinguishing the affiliation relationship of the playlist and track, which is not bad. But from the plots of model accuracy and model loss against epoch, we can see that there is still some overfitting.
confusion_matrix(y_true,y_pred)
-----------------------------------------------------------------------------------------------------------------------------
array([[2596777, 38855],
[ 259714, 1058102]])
We found that the false positive rate is higher compared to false negative rate. But this result is not surprising since the the true value being 0 does not necessarily mean that this song is not qualified for being recommended to the playlist. The false positives can instead being interpreted as not already in the set, but has the potential of being added to the playlist.
def modified_playlist_embedding_model(embedding_size = 75, classification = True, bias=1):
"""Model to embed books and wikilinks using the functional API.
Trained to discern if a link is present in a article"""
# Both inputs are 1-dimensional
playlist = Input(name = 'playlist', shape = [1])
track = Input(name = 'track', shape = [1])
# Embedding the playlist (shape will be (None, 1, 50))
playlist_embedding = Embedding(name = 'playlist_embedding',
input_dim = n_playlist,
output_dim = embedding_size)(playlist)
playlist_bias = Embedding(n_playlist, bias, name="playlist_bias")(playlist)
# Embedding the track (shape will be (None, 1, 50))
track_embedding = Embedding(name = 'track_embedding',
input_dim = n_unique_trackids,
output_dim = embedding_size)(track)
track_bias = Embedding(n_unique_trackids, bias, name="track_bias")(track)
# Merge the layers with a dot product along the second axis (shape will be (None, 1, 1))
merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding])
input_terms = concatenate([merged, playlist_bias, track_bias])
input_terms = Flatten()(input_terms)
# Reshape to be a single number (shape will be (None, 1))
# merged = Reshape(target_shape = [1])(merged)
# If classifcation, add extra layer and loss function is binary cross entropy
if classification:
dense_1 = Dense(20, activation="relu", name = "Dense1")(input_terms)
dense_1 = Dropout(0.2)(dense_1)
merged = Dense(1, activation = 'sigmoid')(dense_1)
model = Model(inputs = [playlist, track], outputs = merged)
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Otherwise loss function is mean squared error
else:
model = Model(inputs = [playlist, track], outputs = merged)
model.compile(optimizer = 'Adam', loss = 'mse')
return model
# Instantiate model and show parameters
modified_model = modified_playlist_embedding_model()
modified_model.summary()
Model: "model_2"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
playlist (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
track (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
playlist_embedding (Embedding) (None, 1, 75) 7500000 playlist[0][0]
__________________________________________________________________________________________________
track_embedding (Embedding) (None, 1, 75) 51135375 track[0][0]
__________________________________________________________________________________________________
dot_product (Dot) (None, 1, 1) 0 playlist_embedding[0][0]
track_embedding[0][0]
__________________________________________________________________________________________________
playlist_bias (Embedding) (None, 1, 1) 100000 playlist[0][0]
__________________________________________________________________________________________________
track_bias (Embedding) (None, 1, 1) 681805 track[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 1, 3) 0 dot_product[0][0]
playlist_bias[0][0]
track_bias[0][0]
__________________________________________________________________________________________________
flatten (Flatten) (None, 3) 0 concatenate[0][0]
__________________________________________________________________________________________________
Dense1 (Dense) (None, 20) 80 flatten[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 20) 0 Dense1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 21 dropout[0][0]
==================================================================================================
Total params: 59,417,281
Trainable params: 59,417,281
Non-trainable params: 0
__________________________________________________________________________________________________
roc1 = roc_curve(y_true, y_score)
roc2 = roc_curve(y_true, y_score_mod)
auc_score1 = roc_auc_score(y_true, y_score)
auc_score2 = roc_auc_score(y_true, y_score_mod)
figure = plt.figure(figsize = (10,8))
plt.plot(roc1[0],roc1[1],label=("Model1, AUC score ={}".format(auc_score1)))
plt.plot(roc2[0],roc2[1],label=("Model2, AUC score ={}".format(auc_score2)))
plt.title("ROC curve")
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.plot( [0,1],[0,1] ,"-.")
plt.legend()
The AUC score for the first model is 0.9649 and the AUC score for the second model is 0.9316.
Also, the first model generalizes better to validation set and test set, indicating that the first model performs slightly better than the second modified model.
Although the second model incorporates more terms and has more layers than the first one, it doesn't improve the predictive power. The inclusion of bias layer doesn't help improve performance in this case.
Therefore, we decided to adopt the first unmodified model for further analysis.
From the neural network model with embeddings, we can not only get to predict if a song should be contained in a playlist, but also the learned embeddings of playlists and songs in a dimension-reduced space. More importantly, after normalization, the dot product of weights of two items is the cosine similarity of the two items. That being said, given a playlist (or song), we can find the most similar playlists (or songs) from the full set and recommend the similar playlist (or songs) to the user.
For this section, we used the weights obtained from model 1 above:
playlist_layer = model.get_layer('playlist_embedding')
track_layer = model.get_layer('track_embedding')
playlist_weights = playlist_layer.get_weights()[0]
track_weights = track_layer.get_weights()[0]
playlist_weights = playlist_weights / np.linalg.norm(playlist_weights, axis = 1).reshape((-1, 1))
track_weights = track_weights / np.linalg.norm(track_weights, axis = 1).reshape((-1, 1))
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 15
def find_similar(idx, weights, index_name = 'playlist', n = 10, least = False, return_dist = False, plot = False):
"""Find n most similar items (or least) to name based on embeddings. Option to also plot the results"""
# Select index and reverse index
if index_name == 'playlist':
index = idx_playlist
elif index_name == 'track':
index = idx_track
# Check to make sure `name` is in index
try:
dists = np.dot(weights, weights[idx])
except KeyError:
print(f'{idx} Not Found.')
return
# Sort distance indexes from smallest to largest
sorted_dists = np.argsort(dists)
# Plot results if specified
if plot:
# Find furthest and closest items
furthest = sorted_dists[:(n // 2)]
closest = sorted_dists[-n-1: len(dists) - 1]
if index_name == "track":
items = [track_name[index[c]] for c in furthest]
items.extend(track_name[index[c]] for c in closest)
else:
items = [index[c] for c in furthest]
items.extend(index[c] for c in closest)
# Find furthest and closets distances
distances = [dists[c] for c in furthest]
distances.extend(dists[c] for c in closest)
colors = ['r' for _ in range(n //2)]
colors.extend('g' for _ in range(n))
data = pd.DataFrame({'distance': distances}, index = items)
if index_name == "track":
# Horizontal bar chart
data['distance'].plot.barh(color = colors, figsize = (10, 8),
edgecolor = 'k', linewidth = 2)
plt.xlabel('Cosine Similarity');
plt.axvline(x = 0, color = 'k');
# Formatting for italicized title
name_str = f'{index_name.capitalize()}s Most and Least Similar to'
for word in track_name[index[idx]].split():
# Title uses latex for italize
name_str += ' $\it{' + word + '}$'
plt.title(name_str, x = 0.2, size = 28, y = 1.05)
return None
elif index_name == "playlist":
data['distance'].plot.barh(color = colors, figsize = (10, 8),
edgecolor = 'k', linewidth = 2)
plt.xlabel('Cosine Similarity');
plt.axvline(x = 0, color = 'k');
name_str = f'{index_name.capitalize()}s Most and Least Similar to'
for word in index[idx].split():
name_str += ' $\it{' + word + '}$'
plt.title(name_str, x = 0.2, size = 28, y = 1.05)
if least:
# Take the first n from sorted distances
closest = sorted_dists[:n]
print(f'{index_name.capitalize()}s furthest from {index[idx]}.\n')
# Otherwise find the most similar
else:
# Take the last n sorted distances
closest = sorted_dists[-n:]
# Need distances later on
if return_dist:
return dists, closest
if index_name == "playlist":
print(f'{index_name.capitalize()}s closest to {index[idx]}.\n')
else:
print(f'{index_name.capitalize()}s closest to {track_name[index[idx]]}.\n')
# Need distances later on
if return_dist:
return dists, closest
# Print formatting
max_width = max([len(index[c]) for c in closest])
# Print the most similar and distances
for c in reversed(closest):
if index_name == "playlist":
print(f'{index_name.capitalize()}: {index[c]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}')
else:
print(f'{index_name.capitalize()}: {track_name[index[c]]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}')
Now, we would like to see how this method of item-to-item or user-to-user recommendation works by exploring the million playlist dataset with our searching algorithm find_similar()
First, we want to should do playlist-to-playlist recommendation:
# Top 10 most common playlist names
list(count_items(idx_playlist.values()))[:10]
-----------------------------------------------------------------------------------------------------------------------------
['Country',
'Chill',
'country',
'chill',
'Christmas',
'Rap',
'Workout',
'Rock',
'Oldies',
'workout']
Good, let's check what we will get if we try to find the similar playlists for the first "chill" playlist (with playlist_id =67)
find_similar(67, playlist_weights, "playlist")
-----------------------------------------------------------------------------------------------------------------------------
Playlists closest to chill.
Playlist: chill (67) Similarity: 1.0
Playlist: Sleep (60872) Similarity: 0.73
Playlist: chilllllll (3155) Similarity: 0.72
Playlist: love (263) Similarity: 0.72
Playlist: my style (79028) Similarity: 0.72
Playlist: Vibes (83919) Similarity: 0.72
Playlist: deep (10822) Similarity: 0.71
Playlist: slow (95088) Similarity: 0.71
Playlist: cameron. (79791) Similarity: 0.71
Playlist: Ocean (95034) Similarity: 0.71
Although the playlist names are user-defined and the name does not necessarily indicate the true extent of that playlist sometimes, we can still find some pattern here: The playlists the most similar to playlist "chill" seem to have "softer" names like: "Sleep", "chilllllll", "deep", "softer"...
What about playlist "rock" (playlist_id = 580)?
find_similar(580, playlist_weights, "playlist")
-----------------------------------------------------------------------------------------------------------------------------
Playlists closest to rock.
Playlist: rock (580) Similarity: 1.0
Playlist: Classic Rock (51536) Similarity: 0.91
Playlist: Classic Rock!!!! (89066) Similarity: 0.9
Playlist: Oldy (31379) Similarity: 0.9
Playlist: Oldies (54852) Similarity: 0.9
Playlist: rock (40893) Similarity: 0.9
Playlist: Rock Out (44510) Similarity: 0.89
Playlist: classics (98893) Similarity: 0.89
Playlist: Party (80299) Similarity: 0.89
Playlist: classic rock (38917) Similarity: 0.89
By comparing the distribution of similarity between the first "rock" playlist and other "rock" playlists and the similarity between the first "rock" playlists and other "chill" playlists, we can clearly see that "rock" playlist is more similar to most "rock" playlists than "chill" playlists.
Example: We want to recommend songs to users who love Halo by Beyoncé:
find_similar(track_idx["spotify:track:2CvOqDpQIMw69cCzWqr5yr"],track_weights, "track",20)
-----------------------------------------------------------------------------------------------------------------------------
Tracks closest to Halo.
Track: Halo (222585) Similarity: 1.0
Track: Irreplaceable (457932) Similarity: 0.93
Track: No One (70271) Similarity: 0.92
Track: Best Thing I Never Had (221610) Similarity: 0.92
Track: If I Were a Boy (608553) Similarity: 0.92
Track: Take A Bow - Main (117995) Similarity: 0.91
Track: Bleeding Love (222284) Similarity: 0.91
Track: Love On Top (261244) Similarity: 0.91
Track: No Air (673503) Similarity: 0.9
Track: Girl On Fire (255886) Similarity: 0.9
Track: Stay - Album Version (Edited) (86096) Similarity: 0.9
Track: Rolling in the Deep (124625) Similarity: 0.9
Track: Unwritten (279580) Similarity: 0.9
Track: Big Girls Don't Cry (Personal) (151650) Similarity: 0.9
Track: Too Little, Too Late - Radio Version (307074) Similarity: 0.89
Track: Just Give Me a Reason (529274) Similarity: 0.89
Track: No Scrubs (115642) Similarity: 0.89
Track: When I Was Your Man (143176) Similarity: 0.89
Track: If I Ain't Got You (433781) Similarity: 0.89
Track: Crazy In Love (645225) Similarity: 0.89
We can also plot out the most similar and the least similar songs of a specific song, say 24K Magic by Bruno Mars:
find_similar(track_idx["spotify:track:6b8Be6ljOzmkOmFslEb23P"],track_weights, "track", 10, plot = True)
By roughly searching for similar playlists and songs based on the cosine similarity derived from the NN embedding model, we are able to get a general idea of how well this crude recommending strategy works. As an approach based on collaborative filtering, embedding neural networks can relate similar songs if the items are distributed similarly among all the playlists and relate similar playlists if they have similar song compositions. This recommendation approach is very efficient among certain population of users, especially those who love to listen to the most trendy music as it's easy to search for similar trendy songs. However, this approach has a major drawback: for those less popular music including many excellent independent music, since they appear less in all the playlists, they are of lower chance being recommended to users. If we depend solely on collaborative filtering, the system would enter a filter bubble in which popular songs will become more popular whereas less popular music will become less popular. One way that could overcome this is to introduce randomness so that we do not always recommend the most relevant song. To further improve this model, we need to incorporate more content-based information, for example the audio features and genres of the songs so that the similarity of songs will not only depend on their distribution in all playlists, but also is a result of common audio features.