In order to accomplish this main goal of the project, we aimed to connect songs (nodes) between them based on their lyrics similarity. Provided that two songs share a specific amount of common words according to the tdf-if index, a directed edge was added, the reason why a lot of effort has been put on text processing and lyrics cleaning. We are considering directed edges since we wanted to take into account the direction of the links. Thus, the edge direction was added linking the older released song to the latest one. Our hypothesis was to observe how some first release songs inspired the consequent released ones.
Nevertheless, in order to establish links the final links of our Beatles Network, we decided to go a step further and try to find a good criteria when linking. This linking criteria not only was based on which were most common words from each song, but also on which were the words in each songs with higher tc-idf score.
With the network in mind, we can further analyze the intrinsic characteristics of our network, such as the degree distributions.
Degree Distributions: This basic stats of the network have been computed. Thus, total degree, in-degree and out-degree distributions of all nodes are obtained and plotted below.
The first plot shows the degree distributions, independently of edge direction. In general, most of nodes are observed to have a degree value between 3 and 15.
Secondly, the in-degree distribution try to show by how many songs have been a song influenced. We can view that many of them have not been influenced by any song or by just few, while, a small proportion of songs have been influenced by several ones, as expected.
When considering the out-degree distribution, we observed a large number of songs with a low out-degree value. This is due to the fact that the latest released songs cannot influence other songs by considering our linking criteria. However, all that songs with an out-degree value above 8 are related to the oldest released songs, likely to have influenced the posterior ones.
To determine the criteria of how many words should be chosen according to the tc-idf score when establishing the links between nodes, we have prioritized the threshold that gave us a higher modularity when aiming to find communities of songs in our network, and obtaining enough nodes and links that makes the analysis of the network significant.This would correspond the optimal point.
With this aim in mind, a loop has been created to see how the size of the GCC network and shape evolves as the number of highest tc-idf-scored words considered to link two songs increases. We started with a threshold of just considering the two highest tc-idf-scored words of songs until reaching 9.
We observed that the largest modularity was found when just considering three common words when linking songs (finding 9 communities). However, the GCC of this network had only 70 nodes and 117 links, losing more than half of total amount of songs. We found a major shift in terms of number of nodes and edges, when considering 5 common words between songs, without losing too much the modularity score and number of communities. To that end, from the graphs and data plotted above, it has been decided to put the threshold of 5 words in common to connect songs, achieving both a high level of modularity and a significant number of links and nodes when considering the GCC. Thus, provided that both songs shared at least one common word within its 5 highest tc-idf-scored words, the link was added.