This project began with a simple question: Why do people love Harry Potter so much? As fans ourselves, we wanted to look at what made these books into a phenomenon. To that end, we turned to Goodreads, an Amazon service that allows users to track what they're reading and to post reviews of books. Using Goodread's API, we extracted the top 100 comments for every Harry Potter book and began to look at the data. We hoped to find some trend in these reviews that showed us which aspects of the books drew readers' praise and admiration.
We began by trying to determine if there were any themes being highlighted in the comments. We thought perhaps we would be able to identify some pattern across the series of the element that drew readers to the books. In particular, based on our familiarity with the texts we thought the novelty of magical world J.K. Rowling invented might be one of the major draws.
Using the Voyant terms widget, a high-level tool for viewing the most used tokens in a corpus we started looking for words that were used with high frequencies across the comments for the entire series.
In analyzing these high-frequency terms, we identified a handful that seemed that seemed to suggest different aspects of the books. For instance, "characters" was the fifteenth most used token after removing stopwords and "world" was nineteenth. These suggested that perhaps readers were most drawn to the characters and the world-building of the series, which aligned with our initial hypothesis. We also identified a handful of words that seemed to refer to themes in the novels themselves, perhaps suggesting what about the world drew readers in, namely: "life" [38], "school" [49], "magic" [58], "friends" [60], and "dark," [72].
Having identified these five terms, we wanted to see if they were unique to the Harry Potter series, or their association with the books was part of a larger trend in publishing at the time the Harry Potter books came out. To determine whether these themes were part of the novelty of Rowling's universe we sought to compare them with the themes of other popular books that came out before during and after the Harry Potter series. Was Harry Potter perhaps among the vanguard of dark books about life, friendships, magic, and schools? Turning once more to the Goodreads API, we extracted the book descriptions of the top 100 books most popular books on the website for every publication year between 1987 and 2016 this gave us about ten years of data on either side of the series, which was published between 1997 and 2007. While imperfect, we hoped to use the descriptions as a proxy for the content of the books, which were unavailable for analysis due to copyright law. We then analyzed the usage trends of the five words we had identified from the Harry Potter comments to see if there was an increase in their usage in the descriptions during and after the Harry Potter books. If we were correct in assuming that Harry Potter was unique at the time it was released, we anticipated seeing a low usage of at least some of these terms before 1997 and relatively high usage after, given the popularity and success of Harry Potter. (Given the size of this corpus we have only embedded a still image of the widget showing the trends for these words)
From this plot, we can see that prior to 1997 (11) there tended to be a great deal of variation in the usage of these terms, with some years having huge peaks and others no use at all. Following 1997, however, the terms began to be used more consistently, a trend we see continue after 2007 (21). While this might support the idea the success of Harry Potter led to these themes becoming a staple part of popular books in the twenty-first century, it does refute the idea that they were a novel introduction in 1997. In fact, both school and magic the two terms we anticipated being linked most strongly to Harry Potter's debut were used more frequently in some of the years before the release of the first Harry Potter book than in the years immediately after.
Since looking at the combined corpus had not revealed any powerful insights into why Harry Potter fans loved the series so much, we decided to slice the corpus by book to see if differences in the popularity of each book with the fan base and the themes commented on for that book would reveal what aspects fans liked the most. Using the scikit-learn python package, we vectorized the comments for each book and then plotted a dendragram that clustered the books based on the cosine distance between each vector.
Through this unsupervised learning process, the algorithm ultimately grouped the comments by the order in which the books were published. Readers seem to speak most similarly about the first two, last two, and middle three books. Interestingly the last two books were grouped more closely with the first two than the middle three, suggesting a similarity between the beginning and ending of the series. At first, we hypothesized this might be because comment writers were more likely to review the entire series in a comment on either the first or last book, but this would not explain the clustering of Harry Potter and the Chamber of Secrets (book two) and Harry Potter and the Half-blood Prince (book six) with books one and seven. Rather, it seems more probable that commenters simply felt differently about the books in the middle, especially Harry Potter and the Goblet of Fire.
Having clustered the books, we then needed to see if these clusters corresponded to the popularity of the books. To do that, we turned the Voyant StreamGraph widget that allowed us to compare the cumulative frequencies of words. Specifically, we compared: "like," "love," "favorite," "best," "wonderful," "awesome," and "incredible" -- words whose use could serve as a proxy for the popularity of the book in the comments.
The StreamGraph does seem to reflect a slight contraction in the middle that corresponds to the groupings created by the clustering algorithm. However, the lower popularity of these middle books based on the comments does not correspond to the actual ratings of the books on Goodreads shown below.
Given the ambiguity of identifying the relative popularity of each of the books, along with the fact that all seven books had high ratings, we decided it would be more fruitful to instead look at the themes of the clusters of books to see if different books were preferred for different reasons. To this end, we decided to employ topic modeling for the comments in each of the two clusters, using the package gensim.
For the first cluster which contained the first and last two books of the series, many topics are fairly vague. Most contain words related reading and books generally. Many have positive words associated with them, again telling us that a major theme in the comments is approval of the series, but most interestingly the topics often heavily feature characters and their attributes.
For the second cluster, the topics are fairly similiar to the first, though the characters mentioned changed. For example rather than Lockhart (in topic 17 of cluster 1) we see a topic (10) that appears to revolve around Sirius Black, a character introduced in The Prisoner of Azkaban.
From our topic analysis, it became clear that the characters in the books were a primary focus of the comments. This result aligned with the high prevalence of the word "characters" in our surface level analysis of the comments using the terms widget. From this investigation, we developed a new hypothesis that connections to the characters, themselves, are what drive the high level of engagement with the Harry Potter books.
Given the centrality of the characters to our new understanding of fans' love for the Harry Potter series, we decided to look at the connections between characters in Harry Potter and individual books. Given that we earlier defined our corpus of Goodreads comments as a proxy for people's opinion of the books, we could now be said to have begun investigating people's opinions of particular characters with respect to each individual book. Specifically, based on investigations using the Voyant widget Mandala, we originally hypothesized that characters would be most positively portrayed in the book to which they were closest in the Mandala.
The Mandala is a high-level Voyant widget in which given search terms act as magnets and pull documents closer to them based on the relative frequency of the search term in the document. We interpreted this closeness as a proxy for how important fans of the Harry Potter series thought a particular character was to a particular book. Since all five of these characters are also very popular, we guessed that when people were talking about them most--and therefore thought them most important to a particular book-- they would be talking about them most positively.
Harry (Potter)*, Ron (Weasley), and Hermione (Granger) were chosen as search terms because readers tended to identify most strongly with them. (Albus) Dumbledore was chosen both because he is both a mentor to Harry and a character who drifted from being entirely 'good' to a more morally ambiguous position, which generated a great deal of discussion; similarly, (Severus) Snape begins the series as an antagonistic figure who is later portrayed as morally ambiguous.
*We used the names that, based on our knowledge of the Harry Potter fandom, would have been most likely to be used in reviews as the 'magnets' -- i.e. 'Dumbledore' instead of 'Albus' and 'Harry' rather than 'Potter'.
Before actually performing sentiment analysis in order to test our hypothesis, it was necessary to extract a suitable sub-corpus of sentences from our larger corpus of Goodreads comments. Since we were only interested in sentences that mentioned our characters of interest-- Harry Potter, Ron Weasley, Hermione Granger, Severus Snape, and Albus Dumbledore-- we created a program which, after tokenizing a document by comment and each comment by sentence, extracted any sentence containing a given character's name and appended it to a list containing all sentences mentioning that character in that document.
Ultimately, each character was associated with seven lists, where each list was associated with one of the seven Harry Potter books and contained the sentences mentioning the given character from the book's Goodreads comments. With this corpus assembled, it was now possible to test our hypothesis.
In order to test our hypothesis, we first needed to find the average sentiment (on a scale of positive, neutral, or negative) of the sentences in each book's Goodreads comments which referenced a particular character. In order to do this, we first found the 'composite' sentiment score-- we used the 'Vader' sentiment analysis package specifically so we could use this metric, as we thought it best encapsulated the overall sentiment of a sentence-- of each sentence containing a particular character's name, and averaged them together.
The end results of this sentiment analysis are the dataframe shown below, where the rows represent the book from which the comments came and the columns represent our five characters of interest, and the graph to its right.
Our original hypothesis was that characters would be most positively portrayed in the book they were closest to in the Mandala widget, because we took "closeness" to be proxy for the fans' opinion of how important a character was to a particular book. The following table shows the significant discrepancies between this hypothesis and the data we collected:
This table, and careful consideration of the graph, shows that our hypothesis was disproven: characters are not portrayed most positively in the comments of the book they were closest to in the Mandala. However, these figures also displays that early books-- in particular, Harry Potter and the Sorcerer's Stone-- were uniformly viewed more positively than later books. This could be, perhaps, due to readers' nostalgia for the beginning of an era-- or simply because Harry Potter and the Sorcerer's Stone is less morally complicated than later books in the series, meaning that characters are more likely to be thought of as unambiguously 'good'.
In any case, it seems as if numerical analysis across books might be invalid without some kind of normalization process-- perhaps by computing the average sentiment within one book and then by charting how many standard deviations from the book's average sentiment average each character's score is.
One of the most interesting observations to come out of our analysis of the 'Mandala' widget, however, was not at all related to our original hypothesis. Rather, we found that, for the most part, the character closest to a particular book-- and, therefore , the character that reviewers thought was most important to the plot of the book-- overlapped significantly with the dataset of the books most important to each individual character. Even theoretically, this needn't have been the case, but it is illustrated specifically in our dataset by the case of Ron Weasley. Using closeness on the 'Mandala' as a proxy for reviewers' impressions of importance or relevance, the book that reviewers seem to think is most relevant to Ron's character is Harry Potter and the Deathly Hallows. However, by that same metric, he's certainly not the character within Harry Potter and the Deathly Hallows who fans thought was most important to the plot: that role goes to Harry Potter, the titular protagonist.
In fact, this discovery in and of itself, although interesting, was not particularly surprising, given that Harry Potter is the series' protagonist. It was surprising, however, that this pattern of Harry Potter being more important to the plot than any other character didn't hold true for all seven books. Specifically, does not hold true for Harry Potter and the Goblet of Fire, Harry Potter and the Order of the Phoenix, and Harry Potter and the Half-Blood Prince, where Hermione Granger, Severus Snape, and Albus Dumbledore, respectively are shown to be most important to each book's plot by the metric of being closest to it on the 'mandala.' But why?
In the next section, we will examine why reviewers of Harry Potter and the Goblet of Fire were particularly interested in Hermione Granger by using the Voyant widget 'Context' to examine the events, settings, and people which reviewers associated with her.
Apart from a general appreciation for Hermione as both a relatable female character and as a steadfast friend to Harry Potter and several very positive reactions to her creation of S.P.E.W., the Society for the Protection of Elvish Welfare, reviewers mentioning Hermione seem to concentrate on the Yule Ball subplot.
Generally speaking, the Yule Ball is important to Hermione's character because, for the first time in the series, it gives her a chance to shine outside of her role as the smart, sensible, responsible one of the Golden Trio. When Hermione dresses up and attends the Yule Ball as the date of Viktor Krum, an international Quidditch star of whom she seems genuinely fond, she asserts her individuality, the fact that she is allowed to be feminine as well as intelligent, and that Ron and Harry are not her entire world. Ron and Harry, predictably, are shocked. This scene, then, represents a cathartic moment for fans who had disliked the way in which Ron and Harry had, in previous books, seemed to take Hermione's willingness to help them for granted and reduced her to her role as The Researcher and The Smart One.
It also serves as a turning point in her relationship with Ron Weasley, which is what most reviewers commented on. Specifically, the Yule Ball serves as the catalyst for Ron Weasley's realization that he has romantic feelings for Hermione. From this point on, romantic tension becomes an integral part of Hermione and Ron's relationship, and given that Ron and Hermione are an incredibly popular 'ship', or romantic pairing, it makes a great deal of sense that the inciting event of their romance would receive attention from reviewers. However, Ron's realization of his budding romantic feelings manifests itself as being incredibly jealous of Hermione's date and being rude to Hermione about her choice to attend the Ball with another TriWizard Champion. As such, despite reviewers acknowledging the Yule Ball as being the source of Ron and Hermione's romance, they generally view Ron's behavior negatively and have a great deal of sympathy for Hermione.