In today’s digital world, the internet and computers have made text—like emails, social media posts, articles, and more—easier to access than ever before. For researchers, this opens up exciting opportunities to understand human behavior, cultural trends, and societal changes. Imagine analyzing thousands of news articles or millions of tweets to uncover patterns in public opinion or explore how language reflects social values. However, there’s a catch: computers aren’t perfect at understanding text the way humans do. That’s where my research comes in.
I studied automated methods that researchers use to analyze large amounts of text. These methods are incredibly powerful but come with challenges. For example, a model might analyze text and give misleading results if it hasn’t been properly “validated” (a fancy way of saying “checked to make sure it’s doing what it’s supposed to do”). My dissertation focused on figuring out how to make these methods more reliable and trustworthy, particularly in the field of social science.
Looking at Word Relationships
One of the tools I studied is called "word embeddings." These are models that try to understand the meanings of words by analyzing how they appear in text. For example, it might learn that "dog" and "puppy" are similar, or that "Paris" is to "France" as "Rome" is to "Italy." I explored how adjusting different settings in these models could change their results and what researchers could do to ensure they’re accurate.
Digging Into Topic Models
Another tool I looked at is "topic modeling," which is like a virtual librarian that organizes a pile of books into topics based on their content. For example, it might group articles into themes like "health," "politics," or "technology." I reviewed hundreds of studies that used this tool and found that researchers often lacked consistent ways to check if their models were working correctly. I also experimented with different validation techniques to show how they influence results.
Making Validation Easier
Across all my studies, I emphasized the importance of being transparent about how these models are built and validated. I offered practical recommendations to help researchers make their models more trustworthy, like documenting their choices, using diverse methods to check their models, and incorporating expert feedback into the process.
Without proper validation, automated text analysis can lead to unreliable or even misleading conclusions. This is a big deal for the social sciences, where researchers study complex human behaviors and societal issues. By improving how we validate these tools, we can ensure that the insights they provide are meaningful and accurate.
This research isn’t just academic—it has real-world implications. For example, it can help policymakers better understand public opinion, enable businesses to track customer sentiment more effectively, and support journalists in uncovering trends in global events. Ultimately, it’s about building tools that help us make sense of the vast amount of information in today’s digital age in ways we can trust.