Text analysis – whether done using qualitative or quantitative methods – is a way to systematically review written material as part of your research. In this video, I will discuss what we mean by “texts” in political science, how we can use qualitative coding to break a text down into relevant parts, and how we can use machine learning to speed up that process for large numbers of texts. I’ll conclude by reminding that this type of analysis is not always the best way to understand texts and documents.
Text analysis is a way to categorize and extract additional meaning from a large amount of written information. It is a way of systematically analyzing written data by breaking documents down into their parts.
For the purposes of this type of research, a text can be just about anything. It could be your own meeting notes, a historical figure’s journal entries, social media posts, or written descriptions of images. And you can find this type of material in many different places, including archives, scraping the web, datasets that include written documents, or you can generate it yourself.
Sample
To be systematic in your approach to analyzing these documents, you need to be systematic in how you obtain them. We look at this in the same way we decide how to conduct a survey – looking at the population, frame, and sample. After identifying the overall type of documents you will study, you need to systematically search for them. For example, if you want to analyze news articles, you need to decide which databases you will search for them in and what dates you want to cover. Finally, you need to decide how to take a sample. Often in text analysis, you want to do a census: you want to read all the journal entries or meeting notes. But if you are studying social media posts, you may want to start with a random sample of messages. You can conduct qualitative coding on a small sample in order to train your computer to analyze them all.
Coding Text
Text analysis often comes down to coding text. What this means is identifying words or phrases that match the variables you are interested in studying and then seeing how often they appear in the text. There are two ways to do this – a top-down and a bottom-up approach. Both require you know the variables you are studying, but they differ in how you define them. The top-down approach involves thinking through all the relevant concepts that might relate to your variables. This works well for large volumes of short texts. For example, in its study of the history of the Black Lives Matter hashtag, Pew Research Center could easily assume a significant number of tweets were related to police violence and search for words related to that topic.
But you aren’t always sure what the concepts in a document might be. If you are transcribing notes or journal entries, you might want to read a few first to get a sense of the content and how it is expressed. You can identify general trends in what is being discussed and use those categories to inform the variables and “code words” you then look for. This is the “bottom-up approach.”
The last thing you want to be attentive to in coding is what information you actually want to gather about these concepts. Are you counting up the number of references? Examining the context in which they were mentioned? Looking for tone (positive or negative, for example)? Looking for steps in a process? There are lots of types of meaning that you could look for – you just need to be clear as to what you are analyzing.
Intercoder Reliability
The process I just described is qualitative – it is done by people reading documents and making decisions. One of the weaknesses of qualitative research is external validity – it is difficult to replicate others’ work. In order to address this concern, we typically have more than one person code the documents we are interested in. By cross-checking the differences between how different individuals described different texts, we can determine how accurate our coding is. This is called checking inter-coder reliability.
Quantitative Approaches
Quantitative approaches to text analysis essential scale up these same strategies to larger numbers of documents. This can mean anything from using R or Python to count incidences of words or patterns of words to what is called “natural language processing” – training a simple AI how to identify the tone or context of phrases. I will mention two common techniques here. First is to mine data from social media posts. I’m not sure what the future of this method will be, but Facebook and Twitter used to make posts available to researchers for projects through an API -- an application programming interface that allowed scholars to read the content of social media posts without obtaining personally identifiable information on the person associated with it. This allowed for research projects on the frequency and content of social media posts during the Arab Spring, for example.
Another common phrase you will hear is “scraping the web” – this just means using a computer program (like Python) to collect raw data from websites. It allows you to analyze blogs or Reddit entries – pretty much anything online that is not password protected.
The principle behind quantitative text analysis, though, is the same as for qualitative coding – you need to identify the variables you are interested in and what words, phrases, or patterns are associated with them.
Conclusion
In conclusion, text analysis – whether qualitative or quantitative – is a powerful tool you can use to compare and contrast documents. I want to end with a note of caution, though. Coding texts is not the same thing as reading them. Coding and this type of analysis can help you quantify tone or compare documents to each other, but it doesn’t tell you the overall meaning of a text. Text analysis should only be used when it makes sense to break a text down into its parts, not when you need overall comprehension.