Latent Dirichlet Allocation:
Latent Dirichlet allocation (LDA) is a topic mode that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set.
Motivation for topic modelling:
In the previous analysis, we have conducted genres classification using audio features and machine learning works well with our datasets. For the next step analysis, we want to use more information available to improve our models. Sound features, or even more precisely, the numerical feature of song tracks are only parts of information we had. For example, data of text format in song tracks such as song names, album names or lyrics may also contribute to building a better classification system. Text elements of songs are powerful description of themes, it is expected that they can provide more features.
Experiment settings:
Next, we will use LDA to explore the topics prevalent in the names. Since names are extreme short text, it should be reasonable to combine all song names by their category. Thus, we have 29 names documents, and the topic modelling result is as below. We set the number of topics as 5 and show the top 20 words for each topic.
Results:
1.Most prevalent models:
The 5 topics are shown as below. Although it is not a good result for topic modelling, it more or less gives a sense of frequent words in some topics. For example, in topic 1, love, single, girl,man, blue make sense since we can infer that this topic is related to romance.
Topic 0:
[(0, '0.016*"love" + 0.012*"s" + 0.012*"t" + 0.010*"remast" + 0.010*"version" + 0.005*"feat" + 0.005*"o" + 0.005*"live" + 0.005*"don" + 0.005*"blue" + 0.004*"go" + 0.004*"m" + 0.004*"de" + 0.004*"time" + 0.004*"like" + 0.004*"one" + 0.004*"night" + 0.004*"la" + 0.004*"come" + 0.004*"man"'),
Topic 1:
(1, '0.017*"love" + 0.017*"feat" + 0.014*"t" + 0.012*"s" + 0.010*"version" + 0.007*"live" + 0.006*"don" + 0.005*"singl" + 0.005*"girl" + 0.005*"m" + 0.004*"like" + 0.004*"get" + 0.004*"man" + 0.004*"go" + 0.004*"got" + 0.004*"can" + 0.004*"time" + 0.004*"blue" + 0.004*"one" + 0.003*"good"'),
Topic 2:
(2, '0.026*"feat" + 0.019*"remix" + 0.018*"radio" + 0.018*"love" + 0.017*"edit" + 0.015*"t" + 0.010*"s" + 0.010*"version" + 0.009*"don" + 0.007*"mix" + 0.007*"like" + 0.006*"girl" + 0.006*"remast" + 0.005*"danc" + 0.005*"feel" + 0.005*"m" + 0.005*"get" + 0.004*"night" + 0.004*"one" + 0.004*"let"'),
Topic 3:
(3, '0.036*"feat" + 0.030*"remix" + 0.017*"edit" + 0.014*"radio" + 0.013*"mix" + 0.013*"t" + 0.010*"love" + 0.008*"version" + 0.007*"s" + 0.007*"origin" + 0.006*"don" + 0.005*"like" + 0.004*"go" + 0.004*"m" + 0.004*"la" + 0.004*"get" + 0.004*"one" + 0.004*"can" + 0.004*"feel" + 0.004*"back"'),
Topic 4:
(4, '0.010*"teil" + 0.008*"soundtrack" + 0.005*"rock" + 0.004*"001" + 0.004*"002" + 0.004*"panik" + 0.004*"paradi" + 0.004*"im" + 0.004*"beach" + 0.004*"littl" + 0.003*"happi" + 0.003*"song" + 0.003*"pictur" + 0.003*"like" + 0.003*"life" + 0.003*"motion" + 0.003*"let" + 0.002*"sing" + 0.002*"sleep" + 0.002*"babi"')]
2.Distributions of topics in each documents:
Some documents cover several topics while others is dominated by a specific topic. For example, in the travel document, 25.56% are topic 0, 43.6% are topic 1, 2.19% are topic 2 and 28.64% are topic 4.
29 Documents topic modelling:
Travel:[(0, 0.255575), (1, 0.436148), (2, 0.021875), (3, 0.286381)]
Trending:[(1, 0.450851), (2, 0.198454), (3, 0.3506406)]
Sleep:[(0, 0.999869)]
Blues:[(0, 0.550395), (1, 0.449533)]
pop:[(0, 0.474297), (2, 0.015768), (3, 0.508847)]
workout:[(3, 0.999641)]
electronic:[(0, 0.135474), (3, 0.864448)]
chill:[(0, 0.641179), (1, 0.068993), (3, 0.285374)]
Punk:[(0, 0.437154), (1, 0.560902)]
Party: [(2, 0.778874), (3, 0.220852)]
Party: [(0, 0.999903)]
Jazz:[(0, 0.2915656), (1, 0.402198), (2, 0.1034030), (3, 0.202805)]
Mood: [(1, 0.842564), (2, 0.051694), (3, 0.105687)]
R&B:[(0, 0.675436), (1, 0.31859)]
Country:[(0, 0.999761)]
Classical:[(1, 0.723858), (3, 0.276060)]
hip-hop:[(0, 0.96168), (3, 0.038175)]
Indie:[(0, 0.267252), (1, 0.730265)]
Soul:[(0, 0.98843), (3, 0.011493)]
Romance: [(0, 0.999901)]
Gaming:[(0, 0.029477), (3, 0.97045)]
Dinner:[(0, 0.991775)]
Folk & Americana:[(0, 0.999849)]
Latin:[(3, 0.999923)]
Funk:[(0, 0.999915)]
Comedy:[(1, 0.999921)]
Metal:[(0, 0.99798)]
Kids:(0, 0.371079), (2, 0.1228861), (3, 0.072841), (4, 0.432084)]
Reggae:[(0, 0.984527), (3, 0.015409)]
3.Text Similarity and Application on classification:
From topic distribution of documents, some names documents are similar to each other. For example, 99.99% of Funk is topic 0 , 99.80% of Metal is topic 0, 99.99% of Sleep is topic 0, 99.99% of Party is topic 0, 99.98% of Country is topic 0, 98.8% of soul is topic 0, 99.99% of Romance is topic 0, 99.18% of dinner is topic 0, 99.98% of Folk & Americana is topic 0. Thus, above category names are share similarity.
From this, we can improve classification efficiency. Since we did binary classification on all genres, with the above information, we can speed up training processes. For example, if we classify some songs to party category, we can also only consider doing the binary classification for other categories above rather than for all categories.