Neural Network Embeddings

Data: Spotify Playlists and Tracks contained in each playlist

Data Preparation:
- playlist_dict: A dictionary with playlist ids as the keys, and a list of name of the playlist, track ids, corresponding track names and artist of the track included in the playlist as values.

playlist_dict = {}

for i in range(100):

    with open("mpd.v1/data/mpd.slice.{}-{}.json".format(i*1000,(i+1)*1000-1),'r') as dt:

        data = json.load(dt)

        for playlist in data['playlists']:

            playlist_dict[playlist['pid']] = [playlist['name'],

                                              [[tracks['track_uri'],tracks['track_name'],tracks['artist_name']]

                                               for tracks in playlist['tracks']]]

idx_playlist: A dictionary with playlist ids as the keys, and playlist names as the values.

idx_playlist = {playlist: playlist_dict[playlist][0] for playlist in playlist_dict}

n_playlist = len(idx_playlist)

track_name: A dictionary with track ids as keys, and track names as values.
track_artist: A dictionary with track ids as keys, and artists of the tracks as values.

track_name_list = []

track_artist_list = []

for i,v in playlist_dict.items():

    track_name_list += (np.array(v[1])[:,0:2]).tolist()

    track_artist_list += (np.array(v[1])[:,(0,2)]).tolist()

track_name = {tup[0]:tup[1] for tup in track_name_list}

track_artist = {tup[0]:tup[1] for tup in track_artist_list}

idx_track: A dictionary with track index as keys, and track ids as values.
track_idx: A dictionary with track ids as keys, and track index as values, reverse of idx_track.

trackids = []

for i in range(n_playlist):

    trackids += list(chain(np.array(playlist_dict[i][1])[:,0]))

n_trackids = len(trackids)

unique_trackids = set(trackids)

n_unique_trackids = len(unique_trackids)

idx_track = {idx:trackid for (idx,trackid) in enumerate(unique_trackids)}

track_idx = {trackid:idx for (idx,trackid) in idx_track.items()}

Generate (playlist index, track index) pair set iterating through every existing pair of playlist-song, and split into training, validation and test sets (Ratio: 18: 2: 5).
- playlist_song_pair_train: 4744137 unique (playlist index, track index) pairs in our training dataset
- playlist_song_pair_val: 527126 unique (playlist index, track index) pairs in our validation dataset.
- playlist_song_pair_test: 1317816 unique (playlist index, track index) pairs in our test dataset.

playlist_song_pair = []

for i,v in playlist_dict.items():

    playlist_song_pair.extend((i, track_idx[song]) for song in np.array(v[1])[:,0])

playlist_song_pair_set = set(playlist_song_pair)

playlist_song_pair_uniq = list(playlist_song_pair_set)

random.shuffle(playlist_song_pair_uniq)

n_train = int(0.8 * len(playlist_song_pair_uniq))

n_val = int(0.1 * n_train)

playlist_song_pair_train = playlist_song_pair_uniq[:(n_train-n_val)]

playlist_song_pair_val = playlist_song_pair_uniq[(n_train-n_val):n_train]

playlist_song_pair_test = playlist_song_pair_uniq[n_train:]

Generator for training, validation and test data:

The generator function generate_batch is designed to generate batches of training, validation and test data.

generate_batch takes :

pairs_gen : train, validation, or test playlist- song pair set)

pairs : all playlist- song pair set

n_positives : number of positive response in a batch

n_playlists : number of unique playlists

n_tracks : number of unique tracks

and returns:

A batch of mixture of pairs X in pairs_gen and pairs not in pairs, and a batch of corresponding response variables y( 1 = the playlist contains the track, 0 = the playlist does not contain the track)

import threading

def locked_iter(it):

    it = iter(it)

    lock = threading.Lock()

    while True:

        try:

            with lock:

                value = next(it)

        except StopIteration:

            return

        yield value

def generate_batch(pairs_gen, pairs, n_positives, negative_ratio, n_playlists, n_tracks):

    batch_size = int(n_positives * (1 + negative_ratio))

    batch = np.zeros((batch_size,3))

    cnt = 0

    lock = threading.Lock()

    while True:

        with lock:

            start_time = time.time()

            for i,(playlist, song) in enumerate(random.sample(pairs_gen, n_positives)):

                batch[i,:] = (playlist,song,1)

            i = n_positives

            while i < batch_size:

                random_playlist = random.randrange(n_playlists)

                random_track = random.randrange(n_tracks)

                if (random_playlist, random_track) not in pairs:

                    batch[i,:] = (random_playlist, random_track, 0)

                    i += 1

            np.random.shuffle(batch)

            X = {'playlist' : batch[:, 0], 'track' : batch[:,1]}

            y = batch[:,2]

            cnt += 1

            yield X, y

Neural Network Embeddings Model Implementation:

1. Embedding Model (Embedding size = 75)

Input : Parallel inputs of playlist index and track index [(None, 1) and (None, 1)]
Embedding: Parallel embedding layers for playlists and tracks [(None, 1, 75)and (None, 1, 75)]
Dot: dot product of playlist embedding vector and track embedding vector (None, 1, 1)
Reshape: Reshape Dot to a single number (None, 1)
Dense: Fully connected layer with sigmoid activation function (None, 1)

def playlist_embedding_model(embedding_size = 75, classification = True):

    """Model to embed books and wikilinks using the functional API.

       Trained to discern if a link is present in a article"""

    # Both inputs are 1-dimensional

    playlist = Input(name = 'playlist', shape = [1])

    track = Input(name = 'track', shape = [1])

    # Embedding the playlist (shape will be (None, 1, 50))

    playlist_embedding = Embedding(name = 'playlist_embedding',

                            input_dim = n_playlist,

                            output_dim = embedding_size)(playlist)

    # Embedding the track (shape will be (None, 1, 50))

    track_embedding = Embedding(name = 'track_embedding',

                               input_dim = n_unique_trackids,

                               output_dim = embedding_size)(track)

    # Merge the layers with a dot product along the second axis (shape will be (None, 1, 1))

    merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding])

    # Reshape to be a single number (shape will be (None, 1))

    merged = Reshape(target_shape = [1])(merged)

    # If classifcation, add extra layer and loss function is binary cross entropy

    if classification:

        merged = Dense(1, activation = 'sigmoid')(merged)

        model = Model(inputs = [playlist, track], outputs = merged)

        model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

    # Otherwise loss function is mean squared error

    else:

        model = Model(inputs = [playlist, track], outputs = merged)

        model.compile(optimizer = 'Adam', loss = 'mse')

    return model

# Instantiate model and show parameters

model = playlist_embedding_model()

model.summary()

Summary of Embedding Model- Matrix factorization method:

Model: "model_1"

__________________________________________________________________________________________________

Layer (type)                    Output Shape         Param #     Connected to

==================================================================================================

playlist (InputLayer)           [(None, 1)]          0

__________________________________________________________________________________________________

track (InputLayer)              [(None, 1)]          0

__________________________________________________________________________________________________

playlist_embedding (Embedding)  (None, 1, 75)        7500000     playlist[0][0]

__________________________________________________________________________________________________

track_embedding (Embedding)     (None, 1, 75)        51135375    track[0][0]

__________________________________________________________________________________________________

dot_product (Dot)               (None, 1, 1)         0           playlist_embedding[0][0]

                                                                 track_embedding[0][0]

__________________________________________________________________________________________________

reshape_1 (Reshape)             (None, 1)            0           dot_product[0][0]

__________________________________________________________________________________________________

dense_1 (Dense)                 (None, 1)            2           reshape_1[0][0]

==================================================================================================

Total params: 58,635,377

Trainable params: 58,635,377

Non-trainable params: 0

__________________________________________________________________________________________________

Embedding Model: Matrix Factorization Method Accuracy

Embedding Model: Matrix Factorization Method Loss

Evaluate model on test set:

test_mod1 = next(generate_batch(playlist_song_pair_test, playlist_song_pair_set, len(playlist_song_pair_test), 2, n_playlist, n_unique_trackids))

y_true = test_mod1[1]

y_score = model.predict(test_mod1[0])

y_pred = y_score > 0.5

accuracy_score(y_true,y_pred)

-----------------------------------------------------------------------------------------------------------------------------

0.9244788346779823

The accuracy of this model on test set is 0.924 on correctly distinguishing the affiliation relationship of the playlist and track, which is not bad. But from the plots of model accuracy and model loss against epoch, we can see that there is still some overfitting.

Checking the confusion matrix:

confusion_matrix(y_true,y_pred)

-----------------------------------------------------------------------------------------------------------------------------

array([[2596777,   38855],

       [ 259714, 1058102]])

We found that the false positive rate is higher compared to false negative rate. But this result is not surprising since the the true value being 0 does not necessarily mean that this song is not qualified for being recommended to the playlist. The false positives can instead being interpreted as not already in the set, but has the potential of being added to the playlist.

2. Modified Embedding Model (Embedding size = 75)

Input : Parallel inputs of playlist index and track index [(None, 1) and (None, 1)]
Embedding: Parallel embedding layers for playlists and tracks [(None, 1, 75)and (None, 1, 75)]
Embedding for bias: Parallel embedding layers for playlists and tracks bias [(None, 1, 1) and (None, 1, 1)]
Dot: dot product of playlist embedding vector and track embedding vector (None, 1, 1)
Concatenate: concatenate dot product, embedding for playlist bias, embedding for track bias (None, 1, 3)
Flatten: Flatten the concatenation (None, 3)
Dense: Fully connected 20-node layer with Relu activation function (None, 20)
Dropout(0.2)
Dense: Fully connected layer with sigmoid activation function (None, 1)

def modified_playlist_embedding_model(embedding_size = 75, classification = True, bias=1):

    """Model to embed books and wikilinks using the functional API.

       Trained to discern if a link is present in a article"""

    # Both inputs are 1-dimensional

    playlist = Input(name = 'playlist', shape = [1])

    track = Input(name = 'track', shape = [1])

    # Embedding the playlist (shape will be (None, 1, 50))

    playlist_embedding = Embedding(name = 'playlist_embedding',

                            input_dim = n_playlist,

                            output_dim = embedding_size)(playlist)

    playlist_bias = Embedding(n_playlist, bias, name="playlist_bias")(playlist)

    # Embedding the track (shape will be (None, 1, 50))

    track_embedding = Embedding(name = 'track_embedding',

                               input_dim = n_unique_trackids,

                               output_dim = embedding_size)(track)

    track_bias = Embedding(n_unique_trackids, bias, name="track_bias")(track)

    # Merge the layers with a dot product along the second axis (shape will be (None, 1, 1))

    merged = Dot(name = 'dot_product', normalize = True, axes = 2)([playlist_embedding, track_embedding])

    input_terms = concatenate([merged, playlist_bias, track_bias])

    input_terms = Flatten()(input_terms)

    # Reshape to be a single number (shape will be (None, 1))

    # merged = Reshape(target_shape = [1])(merged)

    # If classifcation, add extra layer and loss function is binary cross entropy

    if classification:

        dense_1 = Dense(20, activation="relu", name = "Dense1")(input_terms)

        dense_1 = Dropout(0.2)(dense_1)

        merged = Dense(1, activation = 'sigmoid')(dense_1)

        model = Model(inputs = [playlist, track], outputs = merged)

        model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

    # Otherwise loss function is mean squared error

    else:

        model = Model(inputs = [playlist, track], outputs = merged)

        model.compile(optimizer = 'Adam', loss = 'mse')

    return model

# Instantiate model and show parameters

modified_model = modified_playlist_embedding_model()

modified_model.summary()

Model: "model_2"

__________________________________________________________________________________________________

Layer (type)                    Output Shape         Param #     Connected to

==================================================================================================

playlist (InputLayer)           [(None, 1)]          0

__________________________________________________________________________________________________

track (InputLayer)              [(None, 1)]          0

__________________________________________________________________________________________________

playlist_embedding (Embedding)  (None, 1, 75)        7500000     playlist[0][0]

__________________________________________________________________________________________________

track_embedding (Embedding)     (None, 1, 75)        51135375    track[0][0]

__________________________________________________________________________________________________

dot_product (Dot)               (None, 1, 1)         0           playlist_embedding[0][0]

                                                                 track_embedding[0][0]

__________________________________________________________________________________________________

playlist_bias (Embedding)       (None, 1, 1)         100000      playlist[0][0]

__________________________________________________________________________________________________

track_bias (Embedding)          (None, 1, 1)         681805      track[0][0]

__________________________________________________________________________________________________

concatenate (Concatenate)       (None, 1, 3)         0           dot_product[0][0]

                                                                 playlist_bias[0][0]

                                                                 track_bias[0][0]

__________________________________________________________________________________________________

flatten (Flatten)               (None, 3)            0           concatenate[0][0]

__________________________________________________________________________________________________

Dense1 (Dense)                  (None, 20)           80          flatten[0][0]

__________________________________________________________________________________________________

dropout (Dropout)               (None, 20)           0           Dense1[0][0]

__________________________________________________________________________________________________

dense_2 (Dense)                 (None, 1)            21          dropout[0][0]

==================================================================================================

Total params: 59,417,281

Trainable params: 59,417,281

Non-trainable params: 0

__________________________________________________________________________________________________

Modified Embedding Model: Matrix Factorization Method Accuracy

Modified Embedding Model: Matrix Factorization Method Loss

roc1 = roc_curve(y_true, y_score)

roc2 = roc_curve(y_true, y_score_mod)

auc_score1 = roc_auc_score(y_true, y_score)

auc_score2 = roc_auc_score(y_true, y_score_mod)

figure = plt.figure(figsize = (10,8))

plt.plot(roc1[0],roc1[1],label=("Model1, AUC score ={}".format(auc_score1)))

plt.plot(roc2[0],roc2[1],label=("Model2, AUC score ={}".format(auc_score2)))

plt.title("ROC curve")

plt.xlabel("False positive rate")

plt.ylabel("True positive rate")

plt.plot( [0,1],[0,1] ,"-.")

plt.legend()

Comparison of Two Embedding Models

The AUC score for the first model is 0.9649 and the AUC score for the second model is 0.9316.

Also, the first model generalizes better to validation set and test set, indicating that the first model performs slightly better than the second modified model.

Although the second model incorporates more terms and has more layers than the first one, it doesn't improve the predictive power. The inclusion of bias layer doesn't help improve performance in this case.

Therefore, we decided to adopt the first unmodified model for further analysis.

Extracting Embeddings

From the neural network model with embeddings, we can not only get to predict if a song should be contained in a playlist, but also the learned embeddings of playlists and songs in a dimension-reduced space. More importantly, after normalization, the dot product of weights of two items is the cosine similarity of the two items. That being said, given a playlist (or song), we can find the most similar playlists (or songs) from the full set and recommend the similar playlist (or songs) to the user.

Implementation:

For this section, we used the weights obtained from model 1 above:

playlist_layer = model.get_layer('playlist_embedding')

track_layer = model.get_layer('track_embedding')

playlist_weights  = playlist_layer.get_weights()[0]

track_weights = track_layer.get_weights()[0]

playlist_weights = playlist_weights / np.linalg.norm(playlist_weights, axis = 1).reshape((-1, 1))

track_weights = track_weights / np.linalg.norm(track_weights, axis = 1).reshape((-1, 1))

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('fivethirtyeight')

plt.rcParams['font.size'] = 15

def find_similar(idx, weights, index_name = 'playlist', n = 10, least = False, return_dist = False, plot = False):

    """Find n most similar items (or least) to name based on embeddings. Option to also plot the results"""

    # Select index and reverse index

    if index_name == 'playlist':

        index = idx_playlist

    elif index_name == 'track':

        index = idx_track

    # Check to make sure `name` is in index

    try:

        dists = np.dot(weights, weights[idx])

    except KeyError:

        print(f'{idx} Not Found.')

        return

    # Sort distance indexes from smallest to largest

    sorted_dists = np.argsort(dists)

    # Plot results if specified

    if plot:

        # Find furthest and closest items

        furthest = sorted_dists[:(n // 2)]

        closest = sorted_dists[-n-1: len(dists) - 1]

        if index_name == "track":

            items = [track_name[index[c]] for c in furthest]

            items.extend(track_name[index[c]] for c in closest)

        else:

            items = [index[c] for c in furthest]

            items.extend(index[c] for c in closest)

        # Find furthest and closets distances

        distances = [dists[c] for c in furthest]

        distances.extend(dists[c] for c in closest)

        colors = ['r' for _ in range(n //2)]

        colors.extend('g' for _ in range(n))

        data = pd.DataFrame({'distance': distances}, index = items)

        if index_name == "track":

            # Horizontal bar chart

            data['distance'].plot.barh(color = colors, figsize = (10, 8),

                                       edgecolor = 'k', linewidth = 2)

            plt.xlabel('Cosine Similarity');

            plt.axvline(x = 0, color = 'k');

            # Formatting for italicized title

            name_str = f'{index_name.capitalize()}s Most and Least Similar to'

            for word in track_name[index[idx]].split():

                # Title uses latex for italize

                name_str += ' $\it{' + word + '}$'

            plt.title(name_str, x = 0.2, size = 28, y = 1.05)

            return None

        elif index_name == "playlist":

            data['distance'].plot.barh(color = colors, figsize = (10, 8),

                                       edgecolor = 'k', linewidth = 2)

            plt.xlabel('Cosine Similarity');

            plt.axvline(x = 0, color = 'k');

            name_str = f'{index_name.capitalize()}s Most and Least Similar to'

            for word in index[idx].split():

                name_str += ' $\it{' + word + '}$'

            plt.title(name_str, x = 0.2, size = 28, y = 1.05)

    if least:

        # Take the first n from sorted distances

        closest = sorted_dists[:n]

        print(f'{index_name.capitalize()}s furthest from {index[idx]}.\n')

    # Otherwise find the most similar

    else:

        # Take the last n sorted distances

        closest = sorted_dists[-n:]

        # Need distances later on

        if return_dist:

            return dists, closest

        if index_name == "playlist":

            print(f'{index_name.capitalize()}s closest to {index[idx]}.\n')

        else:

            print(f'{index_name.capitalize()}s closest to {track_name[index[idx]]}.\n')

    # Need distances later on

    if return_dist:

        return dists, closest

    # Print formatting

    max_width = max([len(index[c]) for c in closest])

    # Print the most similar and distances

    for c in reversed(closest):

        if index_name == "playlist":

            print(f'{index_name.capitalize()}: {index[c]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}')

        else:

            print(f'{index_name.capitalize()}: {track_name[index[c]]:{max_width + 2}}({c}) Similarity: {dists[c]:.{2}}')

Exploration time!

Now, we would like to see how this method of item-to-item or user-to-user recommendation works by exploring the million playlist dataset with our searching algorithm find_similar()

Playlist-to-playlist Recommendation:

First, we want to should do playlist-to-playlist recommendation:

# Top 10 most common playlist names

list(count_items(idx_playlist.values()))[:10]

-----------------------------------------------------------------------------------------------------------------------------

['Country',

 'Chill',

 'country',

 'chill',

 'Christmas',

 'Rap',

 'Workout',

 'Rock',

 'Oldies',

 'workout']

Good, let's check what we will get if we try to find the similar playlists for the first "chill" playlist (with playlist_id =67)

find_similar(67, playlist_weights, "playlist")

-----------------------------------------------------------------------------------------------------------------------------

Playlists closest to chill.

Playlist: chill       (67) Similarity: 1.0

Playlist: Sleep       (60872) Similarity: 0.73

Playlist: chilllllll  (3155) Similarity: 0.72

Playlist: love        (263) Similarity: 0.72

Playlist: my style    (79028) Similarity: 0.72

Playlist: Vibes       (83919) Similarity: 0.72

Playlist: deep        (10822) Similarity: 0.71

Playlist: slow        (95088) Similarity: 0.71

Playlist: cameron.    (79791) Similarity: 0.71

Playlist: Ocean       (95034) Similarity: 0.71

Although the playlist names are user-defined and the name does not necessarily indicate the true extent of that playlist sometimes, we can still find some pattern here: The playlists the most similar to playlist "chill" seem to have "softer" names like: "Sleep", "chilllllll", "deep", "softer"...

What about playlist "rock" (playlist_id = 580)?

find_similar(580, playlist_weights, "playlist")

-----------------------------------------------------------------------------------------------------------------------------

Playlists closest to rock.

Playlist: rock              (580) Similarity: 1.0

Playlist: Classic Rock      (51536) Similarity: 0.91

Playlist: Classic Rock!!!!  (89066) Similarity: 0.9

Playlist: Oldy              (31379) Similarity: 0.9

Playlist: Oldies            (54852) Similarity: 0.9

Playlist: rock              (40893) Similarity: 0.9

Playlist: Rock Out          (44510) Similarity: 0.89

Playlist: classics          (98893) Similarity: 0.89

Playlist: Party             (80299) Similarity: 0.89

Playlist: classic rock      (38917) Similarity: 0.89

By comparing the distribution of similarity between the first "rock" playlist and other "rock" playlists and the similarity between the first "rock" playlists and other "chill" playlists, we can clearly see that "rock" playlist is more similar to most "rock" playlists than "chill" playlists.

Song-to-song recommendation:

Example: We want to recommend songs to users who love Halo by Beyoncé:

find_similar(track_idx["spotify:track:2CvOqDpQIMw69cCzWqr5yr"],track_weights, "track",20)

-----------------------------------------------------------------------------------------------------------------------------

Tracks closest to Halo.

Track: Halo                                  (222585) Similarity: 1.0

Track: Irreplaceable                         (457932) Similarity: 0.93

Track: No One                                (70271) Similarity: 0.92

Track: Best Thing I Never Had                (221610) Similarity: 0.92

Track: If I Were a Boy                       (608553) Similarity: 0.92

Track: Take A Bow - Main                     (117995) Similarity: 0.91

Track: Bleeding Love                         (222284) Similarity: 0.91

Track: Love On Top                           (261244) Similarity: 0.91

Track: No Air                                (673503) Similarity: 0.9

Track: Girl On Fire                          (255886) Similarity: 0.9

Track: Stay - Album Version (Edited)         (86096) Similarity: 0.9

Track: Rolling in the Deep                   (124625) Similarity: 0.9

Track: Unwritten                             (279580) Similarity: 0.9

Track: Big Girls Don't Cry (Personal)        (151650) Similarity: 0.9

Track: Too Little, Too Late - Radio Version  (307074) Similarity: 0.89

Track: Just Give Me a Reason                 (529274) Similarity: 0.89

Track: No Scrubs                             (115642) Similarity: 0.89

Track: When I Was Your Man                   (143176) Similarity: 0.89

Track: If I Ain't Got You                    (433781) Similarity: 0.89

Track: Crazy In Love                         (645225) Similarity: 0.89

We can also plot out the most similar and the least similar songs of a specific song, say 24K Magic by Bruno Mars:

find_similar(track_idx["spotify:track:6b8Be6ljOzmkOmFslEb23P"],track_weights, "track", 10, plot = True)

By roughly searching for similar playlists and songs based on the cosine similarity derived from the NN embedding model, we are able to get a general idea of how well this crude recommending strategy works. As an approach based on collaborative filtering, embedding neural networks can relate similar songs if the items are distributed similarly among all the playlists and relate similar playlists if they have similar song compositions. This recommendation approach is very efficient among certain population of users, especially those who love to listen to the most trendy music as it's easy to search for similar trendy songs. However, this approach has a major drawback: for those less popular music including many excellent independent music, since they appear less in all the playlists, they are of lower chance being recommended to users. If we depend solely on collaborative filtering, the system would enter a filter bubble in which popular songs will become more popular whereas less popular music will become less popular. One way that could overcome this is to introduce randomness so that we do not always recommend the most relevant song. To further improve this model, we need to incorporate more content-based information, for example the audio features and genres of the songs so that the similarity of songs will not only depend on their distribution in all playlists, but also is a result of common audio features.

References:

Page updated

Google Sites

Report abuse