How much can we learn about a field through text analysis?
In my previous response blog post, I cross-analyzed two corpora: the Harry Potter book series and a compilation of fan-fiction based on the original work. In this blog post, I will summarize the analysis that my partner, Juan Piñeros, and I have come up with while text mining a corpus of journal articles around a specialized knowledge domain. Having engaged with both fiction and nonfiction bodies of work, I conclude by reflecting on both experiences.
Our plan was to choose a field that is in the sweet spot of being unfamiliar but not extremely technical. Psychology seemed to perfectly match what we were looking for. Within the sub-topics of the field, psycholinguistics stood out to us as particularly compelling. After creating a corpus of 135 files using the AntCorGen software that extracts articles from the open-access PLOS One journal, we loaded the corpus onto AntConc. Apart from the frequency of words such as “psychology” and “language”, we were not tantalized by the results to continue working with this corpus. After a careful consideration of the other psychology subtopics, we finally settled on the field of relaxation studies. Although it was relatively familiar, we had never explored articles written on the topic from an academically motivated standpoint and, therefore, decided to give it a go.
AntCorGen generated a 111 file corpus for us. We used a stop words list to strain out the common function words from the more granular ones. From the Words List and the Concordance Plot tools, we can see that the two most frequent words, “participants'' and “study”, are almost evenly dispersed throughout the corpus, hinting at the type of research methodology a lot of the articles might have used to establish their conclusions. An intriguing result was the supremacy of the word “music” in the word list, trumping even that of the word “relaxation”. At first, we speculated that it might be one of those cases in which the word’s frequency is localized in a few files within the corpus. However, the word’s concordance plots entailed otherwise, with “music” being mentioned in 33 files, the number of hits of 15 of which was relatively extremely high.
The words "participants" and "study" are at the top of the most frequent words after filtering out function words. "Music" surprisingly ranks 8th in frequency, a position higher than that of the corpus's topic name, "relaxation".
Concordance plots of "study".
Concordance plots of "participants".
Concordance plots of "music".
To contextualize this frequency, we used the Clusters/N-Grams tool to reveal the common occurrences of “music” in clusters of 2 words. Among the most prominent were “music therapy”, “music perception”, and “music cognition”. With the help of the collocates tool, interesting examples of the words that occurred in proximity to “music” are: “psychomusicology”, “psychophysical”, "neurochemistry", “group”, “tempo”, “genres”, “anxiety”, and “hormonal”. This led us to hypothesize that a good portion of the corpus is actively researching, most likely through controlled experiments and group conditioning, the effects of different genres of music on “psychophysical” health.
Using Clusters/N-Grams tool on the word "music".
Interesting collocates for "music".
The hyper-focus of one journal article on studying ASMR, to the extent of driving the frequency of the word to the top of the word list, is another demonstration of the corpus’s possible interest in the relationship between sound and relaxation.
Frequency of "asmr".
Concordance plot of "asmr" reveals its frequency distribution is localized.
Having used the same analysis approach and software tools on two different corpora, fiction and nonfiction, it is interesting to think about some of the glaring differences that emerged. One of the main distinctive properties of nonfiction text is its impersonal tone, which could easily be detected in this case if we use the Keyword List tool of AntConc that allows us to see the most characteristically frequent words in a corpus in relation to a reference corpus. Using the Harry Potter original books corpus as our target corpus and the relaxation studies corpus as our reference, we see that personal pronouns such as “he”, “she”,“you”, “I”, and “his” are more distinctly used in the fiction texts.
Keyword List tool with HP books as the target corpus and the relaxation studies corpus as the reference corpus.
Reversing the target and reference corpora, personal pronouns are non-existent in the keyword list and the only word used in reference to people is “participants”, affirming the genre's impersonality. Another possible difficulty that could arise from using corpus generation tools such as AntCorGen that access journals classifying research articles by keywords is the possibility of extracting data that deviates greatly from the specialized field one intends to explore through text mining. A great example of this is a similar analysis by my colleagues, Mia Landeck and Ayarush Paudel, conducted on an astronomy corpus only to reveal results that diverge greatly from the corpus topic itself. With fiction texts, this situation might be less probable as the likelihood of compiling deviant corpus elements decreases as a result of the existence of unified authors, in the case of Harry Potter by J. K. Rowling, or unified themes, in the case of the canon’s fan-fiction texts.
Keyword List tool with HP books as the reference corpus and the relaxation studies corpus as the target corpus.
In our exploration of the academic literature written on the psychological study of relaxation, our text-mining tools gave us a decent idea about the focus of a large number of journal articles written on the topic. If this approach yields accurate results, this could be an appropriate start for people who want to break into an area of specialized knowledge, gauge the general trend the current literature in that area is following, or develop a research question based on what has already been studied but simply do not have the bandwidth to close-examine hundreds of studies. However, since journal classification methods can be imprecise, one should proceed with caution as the conclusions drawn from such an analysis could be skewed by factors relating to how the corpus of interest was initially compiled.
Ready for grading!
Date: 4th October 2021