The purpose of this project is to apply network science and basic natural language processing (NLP) methods to analyze the structure of South Park. The data is sourced from a fandom wiki and consists of episode transcripts and synopses. The goal is to see if the structure obtained from this data, can reveal any interesting insights into the show and possibly correlate that structure with the most critically acclaimed seasons. In other words, has South Park changed and has it been for the better?
This site is one of three parts to this project. The YouTube video below describes the overall idea - though it doesn't explain anything that isn't explained on this site as well.
The last part is a jupyter notebook containing the code that generated all these results.
Find it here: Where My South Park Gone
The data is sourced from the South Park Wiki using their WikiMedia API. The data consists of episode synopses and transcripts (dialog and scene descriptions). The data is fan made and maintained. This can cause some data impurities. For instance misspelled, ambiguous or inconsistent names.
There is no way around this when working with real data.
Left: season 1 episode list and synopses. Right: Script from the episode Reverse Cowgirl (season 16 episode 1)
From the transcripts of every episode from the first 21 seasons, there is a total of 4387 unique character names! Obviously this is a LOT more than the number of main and supporting characters on the show. About 82 % of all those characters have less than 10 dialog lines through all 21 seasons! Many of them are one-time characters, misspellings (like Cartma instead of Cartman - see last dialog line in image above) and generic names (like Man and Man 1). These script characters are considered outliers and unnecessary to understand the interactions on the show - so they will be removed.
Keeping the top 100 characters, in terms of dialog lines, still accounts for 70 % of all dialog spoken. This seems like a reasonable cut-off point.
Even among the top 100 characters there are generic character names
One character stands out from the rest: Cartman. He has as much dialog lines as the 4th to 10th combined. He is followed by Stan and Kyle, who both have at least twice as much dialog than any character outside the top 3.
I define an interaction between two characters to be, whenever they both have at least one line of dialog in the same scene. This definition is based on the data from the transcripts that consists of two types of text: dialog and scene descriptions.
Even if characters have multiple lines of dialog in a scene together, it still only counts as one interaction. Multiple interactions between characters happen if they interact in many different scenes together.
Lines of dialog are separated by lines of scene descriptions (green). Each character appearing between these scene descriptions are said to have interacted with each other.
This is by no means a perfect definition. The purpose of the scene descriptions in the transcripts are to convey visual information that is lost without the episode video. Sometimes these descriptions describe actual scene changes, but other times they don't. It can be characters moving around, changes in the environment (but still the same scene), character's emotional state and expression, etc.
But there is no reason to believe that these different kinds of scene descriptions aren't evenly distributed amongst characters. So I take this to be a fair approximation of the actual scenes occurring in the episodes.
Cartman, Stan and Kyle are still the top 3 characters, with the most interactions by far. But this time they are more similar to each other than was the case for dialog volume, where Cartman ruled supreme.
The interactions between characters can be represented as a network, with each character a node and links between nodes as interactions. The links have a weight corresponding to the number of interactions between those two nodes/characters.
The network consists of 100 nodes, 1972 edges and a total of 52720 interactions. Node size is proportional to number of interactions.
The interaction strength of a character is the total amount of interactions he has with all other characters.
The interaction reach of a character is the amount of characters he has at least one interaction with.
Network with nodes scaled according to interaction reach.
The network is fully connected and more evenly in terms of interaction reach compared to interaction strength. Cartman, Stan and Kyle still make up the top 3, with all of them almost connected to every other character. But they don't dominate this measure as much as interaction strength.
There are many ways to group nodes in a network into communities of similar nodes. But these methods can be split into two categories, based on whether they allow a single node to appear in only one community or more. In this work I have looked at the former case, with nodes only belonging to one community. This is a simpler, and still useful approach, but obviously has some limitations. E.g. Stan is both part of his family (mother, father and sister) and part of the gang of main characters (Cartman, Stan and Kyle) and many others on the show.
With this limitation the network achieves a modularity score of 0.156. The modularity score indicates how easy it is to split the network into self-contained communities. A score of 0.156 is rather low, with 1 being the best and -1 being the worst.
The members of the dark blue community is shown on the right. This group contains Cartman, Stan and Kyle and the majority of the other children in their class. So despite the limitations of this community detection method, it has found a somewhat meaningful main community.
Sentiment analysis is primarily a way to determine whether a text is generally positive, neutral or negative. One way to achieve this is to assess the average happiness of each word a text. A simple approach is to use a curated list of words that has been scored for average happiness (between 1 and 10, with 5 as neutral), and take the average of all the words in the text.
In this case, the total amount of dialog lines for each character is cleaned for English stop words, punctuation symbols and concatenated. Then the entire dialog of a character is analyzed for sentiment.
Sentiments of all dialog from Cartman, Stan and Kyle.
Cartman, Stan an Kyle all have very similar distributions with means slightly above neutral. One might expect Cartman, known for his foul language, to difer from the other two, but he doesn't. When looking at the average sentiment of all characters per season, there is little difference from season to season. It seems fair to conclude that mean sentiment scores of dialog isn't a useful metric for analyzing changes in seasons.
This is perhaps not that surprising given that most dialog has a utilitarian and informative nature. A character that predominantly uttered dialog with either positive or negative sentiment would be rather one-dimensional and most likely not a prominent character on a show like South Park.
Maybe a list of profanities (cuss/swear words) or a LIX count might have revealed something more interesting.
Up until season 14 there is a clear trend with Cartman mostly in the lead and Kyle and Stan fighting for 2nd place. Different supporting characters take up the last two spots. But from season 14 and especially season 19-21, the pattern changes. The first spot is taken by a non-main character, Gerald, and Kyle and Stan leave the top 3.
Clearly the show wanted to change it up, with a lot of stories focusing on the supporting cast, but Cartman remains the voice of South Park.
This plot shows the characters with the greatest interaction strength - the most interactions with other characters - per season.
Cartman, Stan and Kyle fight over the top 3 spots through out all seasons, with several supporting characters vying for the remaining top spots. Only exception is season 20, where Stan is replaced by Butters in top 3.
This plot shows the characters with the greatest interaction reach - interacted with as many different characters as possible - per season.
Cartman, Stan and Kyle are still in top 3 for the majority of the seasons. But from season 19 and on, some supporting characters begin to move into the top 3 - namely Randy and Butters.
Centrality is a measure of how important a node is in a network - how central it is in connecting nodes with each other. There are many centrality measures, one being the eigenvector centrality. With this measure, a node's centrality is based on the centrality of the nodes it is linked to.
This plot shows the most central characters per season.
The trend from the two previous plots remain: Cartman, Stan and Kyle in top 3 up until season 19, where other characters become more central. It is Randy and Butters again, including PC Principal in season 19, where he makes his first appearance on the show.
The trends have been clear, whether based on dialog volume, interaction strength and reach, and centrality:
There is a somewhat clear trend - with the modularity scores increasing through the seasons, especially from season 17. This indicates that the characters, based on interactions, are becoming easier to partition into distinct communities.
This change, as a possible explanation, is best exemplified by Gerald (the father of Kyle), who features heavily in season 20 and is the character with the most dialog in that season. He comes in 5th in interaction strength, but is neither in top 5 for interaction reach nor centrality. I.e. he has a lot of dialog in general, quite a bit together with other characters, but not with a wide range of the 100 most significant characters.
This would support the increasing trend in modularity, with a greater focus on supporting characters in self-contained side stories.
Each episode has a synopsis consisting of a few lines of text, describing the overall plot and themes of the episode. Combining all episode synopses for a season results in a text that can be used to describe the themes of the entire season.
This is simply counting occurrences of each word. Instead of just displaying the raw number, a wordcloud is a more visual way of showing the same information. The size of each word in the cloud represent the number of occurrences.
From season 1 to 19 the most frequent words are very generic ones: south, park, cartman, stan, kyle, boys, town. This is not surprising, since those terms are the most central to the show - with the first two being the name of the show and the town that the show is set in.
But season 20 and 21 clearly stand out. Gerald is the most frequent in season 20, and "white" and "middle" for season 21. It is not possible to say too much from this, given that the synopses are very short. Additionally, they are authored by fans (most likely many different ones) and perhaps after 19 seasons, they felt it less necessary to repeat the name of the town/show frequently. But they stand out none the less.
The tf-idf score measures the distinctiveness of a word in text, compared with other texts. In this case, a word is distinctive for a season, if it occurs often in that season and infrequently in other seasons.
Creating wordclouds from the highest tf-idf scores results in season summaries. It is difficult see any trends in the kinds of themes throughout the seasons - other than that they are very varied.
But season 20 and 21 stand out as some of their words with the highest tf-idf scores are also their most frequent words (unlike the other seasons).
A lot of different factors affect the rating of an episode and ultimately the rating of a season. It is very unlikely that the factors looked at in this project is enough to explain the rating trend in great detail.
However, season 20 and 21 are the lowest rated seasons of South Park. And season 19, although a local maximum, is part of the lower half of seasons overall. And these last three seasons are the ones that have stood out in this project. This is where South Park has experimented with new formats for the show - with a greater focus on side stories featuring supporting characters (like Gerald and Randy), recurrent themes throughout a season and multi-part episodes.
These changes are what has been picked up here, using basic network science and NLP methods.
The creators of South Park has stated that they want to return to the standalone episodes format that South Park has utilized predominantly. That might just be what South Park needs.