An instrument is valid only to the extent that it's scores permits appropriate inferences to be made about 1) a specific group of people for 2) specific purposes.
An instrument that is a valid measure of third grader's math skills probably is not a valid measure of high school calculus student's math skills. An instrument that is a valid predictor of how well students might do in school, may not be a valid measure of how well they will do once they complete school. So, we never say that an instrument is valid or not valid...we say it is valid for a specific purpose with a specific group of people. Validity is specific to the appropriateness of the interpretations we wish to make with the scores.
In the reliability section, we discussed a scale that consistently reported a weight of 15 pounds for someone. While it may be a reliable instrument, it is not a valid instrument to determine someone's weight in pounds (assuming the individual does not weigh 15 pounds). Instruments can be valid for one purpose, but not another. Just as a measuring tape is a valid instrument to determine people's height, it is not a valid instrument to determine their weight.
There are three general categories of instrument validity.
Content-Related Evidence
Content validity refers to the extent to which a test measures all aspects of the construct it aims to assess. It involves a systematic examination of the test content to ensure that it covers the entire range of the construct's components. This type of validity is not just about what the test looks like on the surface (face validity) but involves a thorough check that the items of the test representatively sample the domain of interest. Content validity is usually established through expert judgment; subject matter experts evaluate whether the test items adequately cover the construct domain. Do they cover the breath of the content area (does the instrument contain a representative sample of the content being assessed)? Are they in a format that is appropriate for those using the instrument? A test that is intended to measure the quality of science instruction in fifth grade, should cover material covered in the fifth-grade science course in a manner appropriate for fifth graders. A national science test might not be a valid measure of local science instruction, although it might be a valid measure of national science standards. Face validity, which some consider an aspect of content validity, on the other hand, is the most superficial measure of validity. It refers to whether a test appears to measure what it is supposed to measure, based on a non-expert’s perspective. Face validity is about first impressions and is not a rigorous assessment of the test's validity. Instead, it considers whether the test seems valid to those taking it or to other laypeople. While not a scientific measure of validity, face validity can be important for participant engagement and motivation; a test that lacks face validity might be met with skepticism by participants.
Criterion-Related Evidence
Criterion-related evidence is collected by comparing the instrument with some future or current criteria, thus the name criterion-related. The purpose of an instrument dictates whether predictive or concurrent validity is warranted.
· Predictive Validity
If an instrument is purported to measure some future performance, predictive validity should be investigated. A comparison must be made between the instrument and some later behavior that it predicts. Suppose a screening test for 5-year-olds is purported to predict success in kindergarten. To investigate predictive validity, one would give the prescreening instrument to 5-year-olds prior to their entry into kindergarten. The children's kindergarten performance would be assessed at the end of kindergarten and a correlation would be calculated between the screening instrument scores and the kindergarten performance scores.
· Concurrent Validity
Concurrent validity compares scores on an instrument with current performance on some other measure. Unlike predictive validity, where the second measurement occurs later, concurrent validity requires a second measure at about the same time. Concurrent validity for a science test could be investigated by correlating scores for the test with scores from another established science test taken about the same time
Construct-Related Evidence
Construct validity is an on-going process. Construct validity encompasses a broad evaluation of a test's effectiveness in measuring a theoretical construct. It integrates various types of evidence to provide a comprehensive assessment of the test's validity, making it a central focus in the development and evaluation of psychological measures.
· Convergent Validity
Convergent validity tests whether measures of constructs that are expected to be related are actually related. It's about demonstrating that a test correlates well with other measures of the same construct or closely related constructs. The key point here is that if a test has high convergent validity, then it is strongly correlated with other tests measuring the same thing. For instance, if you have developed a new measure of anxiety, it should correlate highly with other well-established measures of anxiety.
· Discriminant (or Divergent) Validity
Discriminant validity, on the other hand, assesses whether concepts or measurements that are supposed to be unrelated are, in fact, unrelated. This form of validity is demonstrated when a test does not correlate strongly with measures from different constructs. Essentially, it's the opposite of convergent validity. Using the previous example, if your new measure of anxiety should not correlate highly with measures of unrelated constructs, such as measures of physical health or intelligence, demonstrating discriminant validity.