On the Importance of Being Valid:
Thoughts on the meaning of QE in the upcoming QE-COVID Data Challenge
(DRAFT VERSION ONLY)
David Williamson Shaffer and Friends
As we embark on this first-of-its-kind QE-COVID Data Challenge, I have been thinking about what makes this a QE-COVID Data Challenge as opposed to any other COVID Data Challenge. After all, there are already quite a number of studies online that look at COVID data, including studies using text mining, natural language processing, and other approaches to the kinds of data that quantitative ethnographers use (e.g., [1–3]).
So, what distinguishes QE research from these other studies and approaches?
At a very surface level, there are some tools in the QE kit that researchers in other areas of the digital humanities and social sciences do not commonly use. But it is also true that researchers can and do implement QE analyses without using tools such as nCoder or ENA (e.g., [4, 5]). Any analytical tool can be used for QE research if it is used in a “QE way.” And, more to the point: it is definitely possible to use QE tools and not be doing good quantitative ethnography. Jim Folkestad once referred to this as “doing quantitative ethnography without the ethnography.”
What distinguishes QE from other approaches to big data is not the tools but rather the focus on validity. Validity is both a complex and contentious term. Scholars debate the merits of various kinds of validity—face validity, construct validity, external validity, and so on [6–9]. But validity, in all of its many faces, is ultimately about something much more important. Validity is about meaning.
The problem of meaning is fundamental to any human endeavor, but it is particularly salient in the many fields that see themselves under the broad umbrella of data science. A great deal of work in this loose collection of disciplines develops and uses statistical techniques to find patterns in large datasets. So far, so good. However, some of that work (and, in fact, too much of that work) is done assuming that the interpretations of the data being analyzed—the words being used, the questions being asked, the demographic information, or even the criteria for success—are unambiguous.
The problem with such approaches is that the meaning of any single data point is hard to ascribe. Understanding the meaning of a piece of text (or a GIF, or a video, or a graph, or a diagram) from an analytical point of view means understanding the interpretation of what is written (or recorded, or drawn) from the point of view of those who make and consume it. This is the challenge of casting our etic analyses in terms of emic perspectives, and it is no small feat. As philosopher Gilbert Ryle made clear, the problem of distinguishing something as simple as a wink from an involuntary twitch is a process that requires an understanding of more than just the immediately observable action of contracting an eyelid [10].
But we do not need to look back to a philosopher writing in the last century to understand the problem here. Just ask any adult who has made the mistake of trying to comment on a meme or TikTok dance to a teenager. The resulting sigh and roll of the eyes are the markers that the adult does not understand the culture they are addressing.
The importance of this is clear. If we are not sure that the underlying data mean what we think they mean, then the patterns that we find do not have much practical use—a problem that computer scientists refer to as GIGO, or garbage in, garbage out. The question, then, is how to make sure that the linkages between data and analysis are valid, in the sense that our interpretation of the analysis and our interpretation of the data are both sound and consistent with one another.
In quantitative ethnography, this question is addressed in two broad ways.
The first is in the process of coding the data, which is, of course, just a formal way of talking about attributing meaning to data. Every code is an interpretation, in the sense that it makes a claim about how we should make sense of some piece of data. There are many processes by which researchers code data, including coding by hand or, especially with larger datasets, using tools like nCoder, topic models, and deep neural nets. Each of those tools has its advantages and disadvantages, and any of them can be used to code data in a QE way—but only if the resulting coding is validated, in the sense that the codes produced, at a minimum, align with the interpretations of at least two human coders.
The reason for using two human coders is that agreement between one human coder and an automated coding process (such as a topic model, set of regular expressions, or a Bayesian neural net) shows that the automated process is modeling one person’s interpretation of the data. But meaning is a cultural—and therefore public—phenomenon. For example, in many languages, there are multiple forms of the word “you” (tú and usted in Spanish; hajoor, tapai, timi, and ta in Nepali, and many other Indo-European languages have multiple forms as well). However well-intentioned, using the wrong “you”—that is, misaligning one’s own understanding of a situation with the surrounding cultural meanings—can cause confusion, laughter, or offense. Getting agreement between two human coders and the automated process (sometimes called 3-way or triangulated agreement) makes sure that the codes applied to the data are aligned with some shared understanding of its meaning.[Footnote 1]
The usual process for doing this is computing some inter-rater reliability (IRR) statistic, and there are many available for the choosing. But IRR statistics are notoriously fickle, in the sense that they provide a measurement of agreement but no indication of how accurate that measure is: they produce a number but no confidence interval.[Footnote 2] This why Shaffer’s ρ (rho) is so important to quantitative ethnography. It provides a statistical warrant that a rate of agreement is over some threshold, and thus makes a measurement of IRR statistically valid.[Footnote 3]
Once codes are (correctly) validated, there is a second sense in which quantitative ethnographers check that their interpretation of the analysis and their interpretation of the data are consistent with one another. One technique for confirming this alignment is closing the interpretive loop, or taking the results of a quantitative model and using it as a starting point for a qualitative re-analysis of the data. Put in simpler terms, closing the interpretive loop means returning to the original data to confirm that quantitative findings give a fair representation of (are a fair sample of [11, 12]) the data.
Closing the interpretive loop, however, is not merely checking the data to see if the quantitative results make sense. It is also examining the model itself to check that the parameters and assumptions of the model are aligned with the mechanisms described by a qualitative analysis.
This is the distinguishing characteristic of quantitative ethnography: that quantitative models need to be unified with qualitative analyses—and why quantitative ethnographers are so particular about their choice of a modeling framework. Finding a statistically significant result is not particularly useful if the mathematical model uses a different structure than the theoretical mechanisms claimed to be at work in the data. So-called black box models make it difficult to confirm that the model is representing the phenomena in question: parameters are not open to inspection and/or the structure of the model is too complex to be inspected. Neural nets, for example, are notorious in this regard. In QE, this problem is described as GOGI, or garbage operationalization, garbage interpretation, and it is why model transparency is so important to quantitative ethnographers.
Closing the interpretive loop should not be confused with predictive validity, which is often used in the fields of data science. Predictive validity is concerned with the accuracy of a model—that is, how reliable it is in predicting some outcome. Quantitative ethnographers care about model accuracy, of course. But they also care about the validity of the interpretations on which a model is based and the meanings it constructs. Testing predictive validity alone is the modeling equivalent of conducting IRR between a model and one human rater: it does not provide a warrant for the cultural (that is, public) interpretations being made. It shows that two things are mathematically related without saying why. As a result, such approaches are particularly vulnerable to problems of systematic bias and the obvious issues that creates.
Where does that leave us as we embark on a QE-COVID Data Challenge?
This line of thinking suggests that what distinguishes a QE analysis from any other exploration in data science is the attention we pay to questions of validity: validity of coding and validity of the model itself. There are a number of tools in the QE kit that help quantitative ethnographers accomplish this. But perhaps the most important is that quantitative ethnographic analyses unify a qualitative and quantitative model. QE analyses need to present both a mathematical and a grounded account of events in the world, both of which are sound on their own, and both of which are attending to the same mechanisms at work in the same set of data.
To be clear, this is not to suggest that there is no room in the QE universe for exploratory analyses, where concerns of validity are relaxed. Quantitative ethnographers can, do, and should at times conduct analyses that use unvalidated codes, or put data into a model as a way of gaining insight into its meaningful structure. There is a time and a place for using latent semantic analysis to get insight into possible categories in the data, or for putting a large set of features into a model to get a mechanical grip on a complex dataset. And of course, it is possible to start with such approaches and validate them later—to use, for example, topic modeling to generate initial codes that are ultimately validated in a final analysis.
But that does not mean that exploratory analyses should be presented as conclusive. For example, in learning analytics, researchers need to be careful not to make assessments about students or recommendations to teachers without holding models to the very highest standards of validity. Otherwise the inferences parents or teachers or administrators or universities make about students may not be meaningful or fair. This is even more important when making public statements about something as serious as a global pandemic. Our analyses have to withstand the challenge of public scrutiny and provide the level of accuracy needed when lives are at stake.
It is, in my opinion, this attention to questions of validity—and specifically the approach of addressing them through the unification of qualitative and quantitative analyses—that distinguishes the field of QE, and that I hope will be the hallmark of the QE-COVID Data Challenge.
References
1. Joshi, B., Bakarola, V., Shah, P., & Krishnamurthy, R. (2020). deepMINE - Natural Language Processing based Automatic Literature Mining and Research Summarization for Early-Stage Comprehension in Pandemic Situations specifically for COVID-19. bioRxiv, 2020.03.30.014555.
2. Li, L., Zhang, Q., Wang, X., Zhang, J., Wang, T., Gao, T.-L., … Wang, F.-Y. (2020). Characterizing the Propagation of Situational Information in Social Media During COVID-19 Epidemic: A Case Study on Weibo. IEEE Transactions on Computational Social Systems, 7(2), 556–562.
3. Lopez, C. E., Vasu, M., & Gallemore, C. (2020). Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset. arXiv, 2003.10359 [cs.SI].
4. Zörgo, S., & Peters, G.-J. Y. (2019). Epistemic Network Analysis for Semi-structured Interviews and Other Continuous Narratives: Challenges and Insights. In Advances in Quantitative Ethnography: First International Conference, ICQE 2019, Madison, WI, USA, October 20–22, 2019, Proceedings (pp. 267–277). Springer.
5. Shum, S. B., Echeverria, V., & Martinez-Maldonado, R. (2019). The Multimodal Matrix as a Quantitative Ethnography Methodology. In Advances in Quantitative Ethnography: First International Conference, ICQE 2019, Madison, WI, USA, October 20–22, 2019, Proceedings (pp. 26–40). Springer.
6. Messick, S. (1987). Validity (ETS Research Reports). Educational Testing Service.
7. Creswell, J. W., & Miller, D. L. (2000). Determining validity in qualitative inquiry. Theory into Practice, 39(3), 124–130.
8. Whittemore, R., Chase, S. K., & Mandle, C. L. (2001). Validity in qualitative research. Qualitative Health Research, 11(4), 522–537.
9. Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111(4).
10. Ryle, G. (1968). The thinking of thoughts: What is “Le Penseur” doing? In University lectures. University of Saskatchewan.
11. Goodman, N. (1978). Ways of worldmaking. Indianapolis, IN: Hackett.
12. Shaffer, D. W. (2017). Quantitative ethnography. Madison, WI: Cathcart Press.
[1] There is, of course, nothing magical about the number two. Getting agreement among more people provides a stronger warrant that meaning is publicly shared. That having been said, the most important difference is between coding based on a single person versus two or more, because that is the line between individual, idiosyncratic interpretation and public, shared understanding.
[2] The historical reasons for this lack of attention to accuracy are neither ignorance of the problem nor neglect. IRR statistics in general do not have normal distributions, and thus it is impossible in most cases to analytically determine appropriate confidence intervals. ρ solves this problem using an empirical (rather than analytical) rejective method. (See [12] for more on this issue.)
[3] The problem of IRR statistics goes far beyond quantitative ethnography. Fields from anthropology to education, to medicine, to sociology—that is, fields in which coding and IRR statistics are used—need to establish the accuracy of measures of agreement, and no field other than quantitative ethnography currently does this. Reporting IRR on a code without establishing its accuracy (using ρ or any other technique) is making the same statistical fallacy as reporting that the means of two groups are different without reporting whether the difference is significant. It is, quite simply, not statistically sound, and therefore bad research practice.