Types of Validity

Post date: Aug 26, 2012 5:3:4 PM

The concept of validity applies to both whole studies (often called inference validity) and the measurement of individual variables (often called construct validity).

Inference Validity

Inference validity refers to the validity of a research design as a whole. It refers to whether you can trust the conclusions of a study. Generally the two key issues are causality and generalizability. Statistical measures show relationships, but it is the theory and the study design that determine what kinds of claims to causality you can reasonably make and what you can make them about.

1. Internal validity (largely about interpretability)

Refers to whether claimed conclusions, especially relating to causality, are consistent with research results (e.g., statistical results) and research design (e.g., presence of appropriate control variables, use of appropriate methodology).

An obvious example of failing internal validity is when a researcher misinterprets a statistical result. For example, in ordinary regression, you want a significant r-square, as this implies that knowing the scores on the independent variables helps predict the dependent variable. But in log-linear models, you want the chi-square test to be non-significant, because it means the model fits -- the predicted values are not significantly different from the observed values. Going from one kind of models to other, it is easy to make a mistake and misinterpret the meaning of the significant chi-square.

Another example is when a researcher claims that the reason X affects Y is because X-->M and M-->Y. A pair of results that is consistent with this causal chain is as follows. When we regress Y on X, the coefficient for X is significant. But when we regress Y on both X and M, the coefficient for X is no longer significant, because we are controlling for the intermediary state. The problem is, mediation isn't the only thing that could lead to that result. It could be that M causes both X and Y, in which case controlling for M would also yield a non-significant result for the coefficient for X. Same empirical results, but a different causal model.

Internal validity can sometimes be checked via simulation, which can tell you whether a given theorized process could in fact yield the outcomes that you claim it does.

2. External validity (generalizability)

This refers to the generalizability of results. Does the study say anything outside of the particular case? For example, in your study of 150 workers in a consulting company's IT department, you find that the more central they are in the friendship network, the better they do their jobs. To what extent can you say this is true of other workers?

A carpenter, a school teacher, and scientist were traveling by train through Scotland when they saw a black sheep through the window of the train. "Aha," said the carpenter with a smile, "I see that Scottish sheep are black." "Hmm," said the school teacher, "You mean that some Scottish sheep are black." "No," said the scientist, "All we know is that there is at least one sheep in Scotland, and that at least one side of that one sheep is black."

Three strategies for strengthening external validity:

  • Sampling. Select cases from a known population via a probability sample (e.g., a simple random sample). This provides a very strong basis for claiming the results apply to the population as a whole.

  • Representativeness. Show the similarities between the cases you studied, and a population you wish your results to be applied to, and argue that the correlations you found in your study will also be similar in the other setting.

  • Replication. Repeat the study in multiple settings. Use meta statistics to evaluate the results across studies. Although journal reviewers don't always agree, consistent results across many settings with small samples is more powerful evidence than a large sample of a single setting -- no matter how large the sample is in that one setting, the setting can be very different from other settings.

Construct Validity

Construct validity refers to the validity of a variable that is being measured. There are many subtypes that have been defined. One should not get too hung up on the exact terminology used because there is a lot variation in usage. The breakdown below is Trochim's version.

1. Translation Validity

Subjective evaluation of whether a measure matches the construct it is meant to measure. Do the questions in the survey makes sense "on the face of it" for measuring what you are trying to measure.

Face validity is often used to mean 'does it pass the test of common sense?'. Does it mean the same thing as the concept. e.g., if you want to know if someone is a liberal, asking "are you a liberal?" has a lot of face validity. Asking 'Do you have a precocious child?' has low face validity (but might predict well).

Content validity. Do all of the elements of the measure seem connected in the right direction to the concept. e.g., in determining if there is there is fire, asking Is there smoke? Destruction? Heat? Ash? Burnt stuff? For a different kind of example, suppose you create a measure emotional warmth in a bunch of emails, and you do this by counting up the number of "warm" words, like "happy", "satisfied". etc. To check the content validity, you would make sure that each of the words used, did in fact have warm connotations for most (relevant) people. Note that a word highly correlated with happiness, such as money, isn't necessarily valid from a content validity point of view. It might be a cause of happiness (doubtful!) but is is not synonymous with happiness.

2. Criterion Validity

How well the measure relates to other measures and characteristics

2.1 Predictive validity. Ability to predict future events. e.g., a divorce scale (an attitudinal battery of questions that measures risk of divorce) should actually predict future divorces. Intent-to-buy attitude scales that actually predict future purchases.

2.2 Concurrent validity. Ability to discriminate between relevant groups. For example, if you are testing math, engineers should do better on the test than poets.

2.3 Convergent validity. Does the measure correlate positively with other measures of the same construct, or measures of very similar constructs? e.g., a new, easier-to-administer scale of organizational commitment should correlate strongly with the old, longer scale that it is intended to replace.

2.4 Discriminant validity. The measure should correlate poorly with measures of different constructs. E.g., we don’t want our emotional intelligence measure to correlate too well with self-monitoring. It should be measuring something different.