Reliability is a crucial aspect of any instrument that measures something. An instrument is reliable if it consistently measures whatever it is designed to measure. For instance, if a person stands on a scale and the scale reads 15 pounds, they would expect to get the same reading if they were to step off and stand on the scale again. This consistency indicates that the scale is producing reliable results. However, the validity of these results is a separate matter, which I discuss later. It is important to remember that an instrument cannot be valid if it is not reliable.
There are three primary categories of reliability for most instruments: test-retest, equivalent form, and internal consistency. Each category measures consistency in a unique way, and not all instruments need to meet the requirements of each category. Test-retest reliability measures consistency over time, equivalent-form reliability measures consistency between different versions of an instrument, and internal consistency measures consistency within the same instrument (i.e., consistency among the questions). A fourth category, scorer agreement, is often used in performance and product assessments to measure consistency in rating a performance or product among different judges.
Generally speaking, longer tests tend to be more reliable, up to a certain point. For research purposes, a minimum reliability of .70 for instruments used to measure attitude is required, although some researchers argue for higher reliabilities. A reliability of .70 means that there is a 70% consistency in the scores produced by the instrument. Many tests, such as achievement tests, strive for reliabilities of .90 or higher.
Test-Retest Method
The test-retest method involves administering the same instrument twice to the same group of people and measuring the correlation between the scores on the two administrations. Consistent results over time should produce related scores. However, determining the appropriate time gap between the two administrations can be tricky. The gap should be long enough to ensure that the subjects do not remember their initial responses, but not so long that their knowledge or skills have changed. Generally, a few weeks to a few months is appropriate, depending on the subject matter being measured. For example, it would not be wise to wait two months to investigate the reliability of a mathematics skills test, as the subjects may have gained additional skills during that time, leading to different scores.
Equivalent Form Method
The equivalent-form method involves creating two different versions of the instrument, both of which measure the same thing. The same subjects then complete both versions during the same time period, and the scores on the two versions are correlated to determine the consistency between the two forms of the instrument.
Internal-Consistency Method
The internal-consistency method involves measuring consistency within a single instrument. There are several different approaches.
· Internal-Consistency Split-Half
One common approach is the split-half method, which involves correlating the total score of the odd-numbered questions with the total score of the even-numbered questions (or the first half with the second half). This method is often used with dichotomous variables that are scored as 0 for incorrect and 1 for correct. The Spearman-Brown prophecy formula is applied to the correlation between the two sets of scores to determine the reliability.
· Internal Consistency Kuder-Richardson Formula 20 (K-R 20) and Kuder-Richardson Formula 21 (K-R 21)
Kuder-Richardson Formula 20 (K-R 20) and Kuder-Richardson Formula 21 (K-R 21) are two alternative formulas used to calculate the consistency of subject responses among the questions on an instrument. These formulas require that items on the instrument be dichotomously scored (0 for incorrect and 1 for correct), and all items are compared with each other. Mathematically, the Kuder-Richardson reliability coefficient is actually the mean of all split-half coefficients resulting from different splittings of a test, provided the Rulon formula is used. K-R 21 assumes that all questions are equally difficult, whereas K-R 20 does not make this assumption.
· Internal Consistency Cronbach’s Alpha
Cronbach's alpha is another measure of internal consistency that is often used when items on an instrument are not scored as right versus wrong. This is commonly the case with attitude instruments that use the Likert scale. Cronbach's alpha is calculated using a computer program. Although Cronbach's alpha is usually used for scores that fall along a continuum, it will produce the same results as K-R 20 with dichotomous data (0 or 1).
I created an Excel spreadsheet that will calculate Spearman-Brown, KR-20, KR-21, and Cronbach's alpha. The spreadsheet will handle data for a maximum 1000 subjects with a maximum of 100 responses for each.
Scoring Agreement
Performance and product assessments are often based on scores by individuals who are trained to evaluate the performance or product. The consistency between rating can be calculated in a variety of ways.
· Inter-Rater Reliability
Simplistically, two judges can evaluate a group of student products and the correlation between their ratings can be calculated (r=.90 is a common cutoff). More complexly, iInter-rater reliability assesses the degree of agreement among raters but takes into account the possibility that agreement could occur by chance. There are several statistical methods to calculate inter-rater reliability, with Cohen's kappa being one of the most common for two raters and Fleiss' kappa or the intraclass correlation coefficient (ICC) for more than two raters. These statistics not only consider the percentage of agreement but also correct for the agreement that would be expected purely by chance. This is particularly important in situations where there might be a high likelihood of chance agreement due to the distribution of categories or the prevalence of certain outcomes.
· Percentage Agreement
Percentage agreement is a simpler and more intuitive measure than inter-rater reliability. It is calculated by dividing the number of instances where raters agree by the total number of instances, then multiplying by 100 to get a percentage. For example, two judges can evaluate a group of products and a percentage for the number of times they agree is calculated (80% is a common cutoff). This method does not account for the possibility of chance agreement; it simply reflects the proportion of times that raters gave the same rating or made the same decision. Because of its simplicity, percentage agreement can be easier to calculate and interpret but may not provide a complete picture of reliability.
---------
All scores contain error. The error is what lowers an instrument's reliability.
Obtained Score = True Score + Error Score
----------
There could be a number of reasons why the reliability estimate for a measure is low. Four common sources of inconsistencies of test scores are listed below:
Test Taker -- perhaps the subject is having a bad day
Test Itself -- the questions on the instrument may be unclear
Testing Conditions -- there may be distractions during the testing that detract the subject
Test Scoring -- scores may be applying different standards when evaluating the subjects' responses