Inter-Rater Agreement

This module is designed to help your RSSP improve the consistency of team member ratings as they collect implementation data via classroom walkthroughs. Much of the content presented in this session is adapted from the U.S. Department of Education-sponsored report Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings, with additional content taken from A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research, Intraclass correlation – A discussion and demonstration of basic features, and Intrarater Reliablity: the Kappa Statistic

Inter-Rater Agreement Spreadsheet

You will use this spreadsheet to complete the activities embedded throughout this module.

The Need for Inter-Rater Agreement

Definition

Inter-rater agreement is the degree to which different raters provide equivalent ratings to the same set of behaviors or conditions. In our RSSP work, inter-rater agreement most frequently refers to the consistency of scores raters assign to teachers during classroom walkthroughs.

Implications

If raters do not score teacher implementation of RSSP focus strategies consistently, the corresponding implementation data will be unreliable and can lead our teams to make faulty conclusions. To demonstrate this phenomenon, consider the following situation:

An RSSP team selects three team members to conduct walkthroughs to assess the implementation fidelity of HQIM in 20 classrooms. The three team members visit each of the 20 classrooms at the same time. After the walkthroughs have been completed and MOY assessment data becomes available, the team creates a chart that plots the average walkthrough score of all team members against the average MOY student score per class.

The scatterplot displays a neutral relationship between HQIM implementation scores and average student test scores per classroom. Given these results, the RSSP team concludes that HQIM implementation is not connected to student test scores in any meaningful way. However, if the RSSP team would have created scatterplots that displayed the scores of each rater against average student test scores, they would have seen the following:

These scatterplots reveal that the relationship ratings of each team member and average student test scores vary wildly. Rater A's scores were positively associated with average student test scores, rater B's scores had a roughly neutral association with average test scores, and rater C's scores were negatively associated with average student test scores. When the RSSP team averaged the scores of the three raters, they effectively cancelled out rater A and rater C's scores, leaving rater B's scores as the primary determinant of the correlation between average walkthrough scores and average student test scores.

If the implementation ratings vary widely in your RSSP data, you should be skeptical of the processes used to collect the implementation data. When there is little consistency among raters, it is difficult to discern which rater's scores most accurately captured the degree to which teachers implemented HQIM. When there is consistency among raters, you can have greater confidence that their unified ratings accurately captured HQIM implementation. A lack of consistency among raters inserts real limitations into our ability to make sense of the data and draw meaningful conclusions.

How to Calculate Inter-Rater Agreement

Calculating Inter-Rater Agreement is the first step to designing implementation data collection processes that produce consistent measures.

There are three primary ways to calculate inter-rater agreement:

Percent Absolute Agreement
Cohen's Kappa
Intra-Class Correlation

Percent Absolute Agreement

Definition

The percentage of times scores matched across raters.

Formula

Observed Rater Agreement/Total Possible Agreement

Formula Explanation

Calculate the amount of times rater scores perfectly matched and divide by the total amount of times rater scores could have perfectly matched. Doing this creates a value that describes the percentage of times rater scores matched.

Interpretation

The Percent Absolute Agreement ranges from 0-100%. A value of 75% is considered the minimum acceptable value, while a score of 90% or above is considered excellent.

Activity

Use the tab titled Percent Absolute Agreement Exercise on the Inter-Rater Agreement Spreadsheet to calculate the Percent Absolute Agreement of the embedded sample data set. Reference the Percent Absolute Agreement Answer Key tab to check your work.

Inter-Rater Agreement Spreadsheet

Cohen's Kappa

Definition

Cohen’s Kappa is a statistic that measures non-random inter-rater agreement for qualitative categories. Non-random refers to agreement between raters that happened beyond what we would expect based purely on random chance. Qualitative categories refer to classifications defined by words that designate a value.

Formula

(Observed Agreement - Probability of Random Agreement)/(1 - Probability of Random Agreement)

Formula Explanation

When we subtract the probability of random agreement from the observed agreement, we are creating a number that represents the actual amount of agreement beyond random chance. This number becomes the numerator (the top number in the fraction).

When we subtract the probability of random agreement from 1, we are creating a number that represents the total amount of possible alignment beyond random chance. This number becomes the denominator (the bottom number in the fraction).

When we place the amount of observed non-random agreement in the numerator and the amount of total possible non-random agreement in the denominator, we create a proportion that compares how much non-random agreement the raters achieved to how much non-random agreement they could have achieved.

For example, if the observed agreement among RSSP raters is .54 and the probability of random agreement is .33, we would complete the equation for Cohen's Kappa using the following steps:

Step 1: (.54 - .33) / (1 - .33)

Step 2: .21/.67

Step 3: .31

In this example, the observed agreement among RSSP raters beyond random chance is .21 and the total possible agreement among RSSP raters beyond random chance is .67. Because non-random observed agreement is only approximately 1/3 of total possible non-random agreement, Cohen's Kappa is an underwhelming .31.

Interpretation

Cohen's Kappa ranges from less than zero to one (<0 - 1), where

<0 = no agreement beyond random chance

.01 - .2 = very slight agreement beyond random chance

.21 - .4 = fair agreement beyond random chance

.41 - .6 = moderate agreement beyond random chance

.61 - .8 = substantial agreement beyond random chance

.81 - 1.0 = almost complete agreement beyond random chance

Activity

Use the tab titled Cohen's Kappa Exercise on the Inter-Rater Agreement Spreadsheet to calculate Cohen's Kappa using the embedded sample data set. Reference the tab titled Cohen's Kappa Answer Key to check your work.

Inter-Rater Agreement Spreadsheet

Intra-Class Correlation (ICC)

Definition

The intra-class correlation is a statistic that measures the degree to which raters' scores correlate. The ICC answers the question "As one rater's scores go up or down, how closely do the scores of other raters also go up or down?"

The intra-class correlation can be compared to the inter-class correlation. The inter-class correlation conveys the degree to which measurements in different classes of data co-vary (as student attendance rates go up or down, how closely do student test scores also go up or down?) The intra-class correlation conveys the degree to which measurements within the same class of data co-vary. In the description above, the scores of raters belong to the same class.

Formula

Variance among teachers / total variance OR

Variance among teachers/(variance among teachers + variance among raters)

Formula Explanation

Variance is a measure of the squared differences between a value and its mean. For example, if a rater gave three teachers walkthrough scores of 2, 3, and 4 respectively, the variance of those scores would be calculated as follows:

Step 1 (find the mean): (2+3+4)/3 = 3

Step 2 (find the difference between each value and the mean): 2 - 3 = -1, 3-3 = 0, 4-3 = 1

Step 3 (square the differences): -1^2 = 1, 0^2 = 1, 1^2 = 1

Step 4 (add the squared differences): 1 + 0 + 1 = 2

Step 5 (finish): 2 is the variance among teachers.

When we conduct RSSP walkthroughs, there are two types of variance we can calculate. We can calculate the variance of scores given to different teachers by the same rater, and we can calculate the variance of scores given to the same teacher by different raters. In an ideal world, we would want the variance of scores given to the same teacher by different raters to be zero. If the variance among raters was zero, that would mean that there were no differences in the scores they assigned to teachers, indicating perfect inter-rater agreement. In reality however, we will almost always have some variance among our raters.

When we divide the variance among teachers by the variance among teachers + the variance among raters, we are creating a proportion that tells us how much of our total variance can be explained by variance among teacher scores. If the variance among teacher scores was 25 and the variance among raters was 0, then we would calculate the ICC as follows:

25/(25 + 0) = 25/25= 1

Because the variance among raters is zero, the formula instructed us to simply divide the variance among teachers by itself. The value of 1 indicates a perfect correlation among raters.

However, if the variance among teachers was 25 and the variance among raters was 75, then we would calculate the ICC as follows:

25/(25+75) = 25/100 = .25

Because the variance among raters is much larger than the variance among teachers, the formula produced the value .25, indicating a very weak correlation among raters.

Interpretation

The ICC ranges from 0 - 1, where 0 indicates no correlation (or inter-rater agreement) and 1 indicates perfect correlation (or inter-rater agreement).

Activity

Use the tab titled Intra-Class Correlation Exercise on the Inter-Rater Agreement Spreadsheet to calculate the ICC using the embedded sample data set. Reference the tab titled Intra-Class Correlation Answer Key to check your answers.

*Note: the formula described above is the simplified theoretical ICC formula. There are many ICC formulas, the use of which are determined by the specific contexts and purposes of the data. In this exercise, we will use the formula designed for multiple raters issuing scores for multiple subjects (or in our case, teachers).

Inter-Rater Agreement Spreadsheet

How to Determine which Inter-Rater Agreement Statistic to Use

When determining which Inter-Rater Agreement statistic should be used, the Center for Educator Compensation Reform provides the following guidance:

Because no one method is best under all circumstances, it is often appropriate to calculate more than one measure. Typically, if there are four or

fewer discrete rating levels, Cohen’s kappa and the percentage of absolute agreement should both be calculated. If there are a moderate number of

performance levels (e.g., 5-9), one could use the ICC as well as the percentage of absolute agreement. If scores are on a continuous scale

(decimals/fractions are possible values), then one should always use the ICC to calculate inter-rater agreement.

Additionally, Koo and Li (2016) describe that "As a rule of thumb, researchers should try to obtain at least 30 heterogeneous samples and involve at least 3 raters whenever possible when conducting a reliability study." In other words, inter-rater agreement statistics are more accurate when we use at least three raters to observe at least thirty classrooms.

Using Inter-Rater Agreement in Practice

Being able to calculate Inter-Rater Agreement helps us assess the consistency in the scores between our raters. However, creating the circumstances within our RSSP districts where we can calculate inter-rater agreement can be challenging. In reality, we can only calculate inter-rater agreement if multiple raters are observing the same teachers at the same time. If multiple raters observe the same teachers at the same time, we can be confident that any differences in ratings was caused by inconsistencies among raters. If the same raters observed the same teachers at different times, then we will be unsure how much of the difference in ratings was caused by inconsistencies among raters and how much was caused by inconsistencies in teacher behavior. For these reasons, the best practice is to have all raters observe the same teachers at the same time.

However, the personnel and time constraints that exist at the school level make coordinating schedules difficult. In situations where having multiple raters observe live classrooms at the same time is impractical, the Center for Education Compensation Reform recommends the following strategies:

Training Raters Prior to Observations

Training raters prior to observations is an essential practice. Raters need to have opportunities to develop a unified understanding of the rating criteria, practice observing teachers using the criteria, and receiving feedback on the ratings they gave.

Raters can be trained while observing classrooms, establishing clear expectations among raters and observed teachers that the preliminary observations are meant to help raters develop consistency in their ratings. After each observation, raters and the rater-trainer (i.e. Data Fellow in most cases) can compare ratings, discuss discrepancies, and make changes in how raters issue scores to improve inter-rater agreement. After each observation, the Data Fellow can calculate the relevant inter-rater agreement statistics to assess consistency among raters. Once raters are capable of regularly producing sufficiently high inter-rater agreement statistics, they can be sent to conduct observations of different teachers at different times.

Raters can also be trained through the use of classroom recordings. Using this method enables the RSSP team to simply work with teachers to record their classes and schedule time for raters to simultaneously watch and rate the recordings. After each recording, raters and trainers should compare ratings, discuss discrepancies, and make changes to improve consistency. The Data Fellow can use the relevant inter-rater agreement statistics to determine when raters are ready to conduct observations in classrooms.

Of important note, The Center for Educator Compensation Reform indicates that training needs to be longer than "an hour or two to be effective. Researchers have found short training session to be ineffective at calibration and unlikely to produce consistent results."

Use Recordings in Lieu of In-Person Observations

In addition to using recordings to train raters, RSSP teams can use classroom recordings as the primary means of conducting observations. This simplifies the scheduling process and enables RSSP team members to evaluate classroom recordings synchronously or asynchronously. Because raters observe the same recordings for each teacher, the Data Fellow can calculate inter-rater agreement statistics that accurately reflect the discrepancy among reviewers.

Test the Observation Rubric Prior to Formal Implementation

The evaluation rubric should have clear criteria that are relevant to the practice being observed, easy for raters to understand, and simple for raters to apply. The more difficult the criteria are to understand and utilize, the less consistent raters will be.

An effective way to create a rubric that is relevant, clear, and usable is to get rapid feedback from raters, teachers, and school leaders. Ask teachers and leaders to review the rubric and provide specific feedback on the relevance of the criteria in evaluating instruction. Have raters conduct observations with the sole purpose of testing the rubric to ensure they understand how to apply the criteria to the situations they observe in the classroom. Implement the feedback received from teachers, leaders, and raters and then re-test the rubric by re-using the same processes for soliciting feedback. Continue the cycle until all problems with the rubric have been addressed. Undergoing this process of rapid iteration will dramatically lower the chances of having to adjust the rubric after formal observations have begun.

Final Thoughts

This module was designed to help improve the consistency of classroom walkthrough data. Creating observation rubrics, training raters, creating observations schedules, and calculating intra-rater agreement statistics is difficult work. However, doing so increases your confidence in the reliability of your RSSP implementation data. Collecting reliable data is foundational. There is little point in analyzing data you can not trust.

We wish you the best of luck as you work to implement these practices within your district!

Exit Ticket

Module Complete!

Congratulations on completing the module. Please complete the Exit Ticket form by clicking on the link above. We will use the information you submit to track your completion.

Page updated

Report abuse

Inter-Rater Agreement

Inter-Rater Agreement Spreadsheet

Module Complete!

https://bit.ly/EDdatafellows