statrefs home‎ > ‎Main‎ > ‎Methods‎ > ‎Measurement System Analysis‎ > ‎

Interrater Agreement

Inter-rater Agreement studies are based on Kappa-type statistics. 

There are several types of Kappa statistics that are designed for various situations.  The Inter-rater Agreement study is most often based on Fleiss' Kappa.  For a study with only two raters, Cohen's Kappa is available.

What is the advantage of Fleiss' Kappa?

Fleiss' kappa has benefits over the standard Cohen's kappa as it works for multiple raters, and it is an improvement over a simple percentage agreement calculation as it takes into account the amount of agreement that can be expected by chance.

Scales available for use with Fleiss' Kappa

Various scales are available for the subjective interpretation of the strength of the level of agreement for Fleiss' Kappa. 

For a six-level scale, see:
  • "(citizendium) Fleiss kappa.pdf" (page 5)
  • "Landis Koch Observer Agreement.pdf" (page 165)

For a three-level scale, see:
  • "Fleiss Text (Interrater).pdf" (page 604)

Tests of Hypotheses

When interpreting the meaning of the p-value that is reported for the Kappa statistic, the null hypothesis is that the value of the Kappa statistic is zero.
  • A p-value that is less then the chosen alpha (generally 0.05) would cause us to reject the null hypothesis that the value of Kappa is zero.
  • For a Kappa value that is greater than zero, the conclusion is that the amount of agreement is more than can be explained only by chance.

Several situations are possible:
  • Small value of Kappa that is not statistically significant.  (Cannot conclude that Kappa is not equal to zero.)
  • Small value of Kappa that is statistically significant.  (There is more agreement than can be explained only by chance, but not much more.)
  • Large value of Kappa that is not statistically significant.  (This probably means that the standard error for the Kappa statistic is large.)
  • Large value of Kappa that is statistically significant.  (Very good agreement that is much more than can be explained by chance alone.)

Fleiss' Kappa values that can be designed into the study include:
  • Agreement with self and other raters
  • Inter-rater agreement of evaluators as a group
  • Intra-rater of each individual evaluator (the ability of a specific rater to agree with him/herself)
  • Agreement with a standard
  • Agreement of all evaluators (as a group) with the standard
  • Agreement of each individual evaluator with the standard
  • If several types of nonconformities are present in the specimens, agreement can assess for each type of nonconformity.

Relationship to Logistic Regression

Fleiss' Kappa and Logistic Regression are not directly related, but Logistic Regression can sometimes be used as a supplementary method when performing an Attribute MSA study, such as comparing results between sites.

Logistic regression may also be applied for purposes such as:
  • Comparison of results between raters/evaluators
  • Comparison of results between facilities
  • Comparison of results between test equipment
  • Comparison of results between other grouping variables or strata


Fleiss' Kappa and BIAS

An indicator of potential bias would be a measure of poor agreement with a known standard (known 'good' and 'poor' specimens, or other type of standard).

Having all raters as a group in agreement with a standard is good.
Having one or more individual raters with poor agreement with a standard can be an informative diagnostic tool to improve the test method.


The value of the Fleiss' Kappa statistic for all raters as a group is a useful measure of Reproducibility.

It is also helpful if each rater can be shown to agree with themselves over multiple ratings of each specimen. 
    (Question ... is this a measure of repeatability?)


Observations are independent bernoulli events.


Several references are included in the attachments at the bottom of this page.


att xferred