Inter-rater reliability

Problem
Solution

Problem

You want to calculate inter-rater reliability.

Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

Categorical data

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the diagnoses data set in the irr package.

library(irr)
#> Loading required package: lpSolve

data(diagnoses)
dat <- diagnoses[,1:3]
#                  rater1                  rater2                  rater3
#             4. Neurosis             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression           1. Depression        3. Schizophrenia
#        3. Schizophrenia        3. Schizophrenia        3. Schizophrenia
#           1. Depression           1. Depression        3. Schizophrenia
#           1. Depression           1. Depression             4. Neurosis
#                5. Other                5. Other                5. Other
#           1. Depression             4. Neurosis             4. Neurosis
#           1. Depression 2. Personality Disorder             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#        3. Schizophrenia        3. Schizophrenia        3. Schizophrenia
#           1. Depression           1. Depression           1. Depression
#           1. Depression           1. Depression           1. Depression
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other
# 2. Personality Disorder             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression           1. Depression             4. Neurosis
#           1. Depression             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression           1. Depression           1. Depression
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other

Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

kappa2(dat[,c(1,2)], "unweighted")
#>  Cohen's Kappa for 2 Raters (Weights: unweighted)
#> 
#>  Subjects = 30 
#>    Raters = 2 
#>     Kappa = 0.651 
#> 
#>         z = 7 
#>   p-value = 2.63e-12

N raters: Fleiss’s Kappa, Conger’s Kappa

If there are more than two raters, use Fleiss’s Kappa.

kappam.fleiss(dat)
#>  Fleiss' Kappa for m Raters
#> 
#>  Subjects = 30 
#>    Raters = 3 
#>     Kappa = 0.534 
#> 
#>         z = 9.89 
#>   p-value = 0

It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

kappam.fleiss(dat, exact=TRUE)
#>  Fleiss' Kappa for m Raters (exact value)
#> 
#>  Subjects = 30 
#>    Raters = 3 
#>     Kappa = 0.55

Ordinal data: weighted Kappa

If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

We will use a subset of the anxiety data set from the irr package.

library(irr)
data(anxiety)

dfa <- anxiety[,c(1,2)]
dfa
#>    rater1 rater2
#> 1       3      3
#> 2       3      6
#> 3       3      4
#> 4       4      6
#> 5       5      2
#> 6       5      4
#> 7       2      2
#> 8       3      4
#> 9       5      3
#> 10      2      3
#> 11      2      2
#> 12      6      3
#> 13      1      3
#> 14      5      3
#> 15      2      2
#> 16      2      2
#> 17      1      1
#> 18      2      3
#> 19      4      3
#> 20      3      4

The weighted Kappa calculation must be made with 2 raters, and can use either linear or squared weights of the differences.

# Compare raters 1 and 2 with squared weights
kappa2(dfa, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.297 
#> 
#>         z = 1.34 
#>   p-value = 0.18


# Use linear weights
kappa2(dfa, "equal")
#>  Cohen's Kappa for 2 Raters (Weights: equal)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.189 
#> 
#>         z = 1.42 
#>   p-value = 0.157

Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

kappa2(dfa, "unweighted")
#>  Cohen's Kappa for 2 Raters (Weights: unweighted)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.119 
#> 
#>         z = 1.16 
#>   p-value = 0.245

Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

# Make a factor-ized version of the data
dfa2 <- dfa
dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6])
dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6])
dfa2
#>    rater1 rater2
#> 1       C      C
#> 2       C      F
#> 3       C      D
#> 4       D      F
#> 5       E      B
#> 6       E      D
#> 7       B      B
#> 8       C      D
#> 9       E      C
#> 10      B      C
#> 11      B      B
#> 12      F      C
#> 13      A      C
#> 14      E      C
#> 15      B      B
#> 16      B      B
#> 17      A      A
#> 18      B      C
#> 19      D      C
#> 20      C      D

# The factor levels must be in the correct order:
levels(dfa2$rater1)
#> [1] "A" "B" "C" "D" "E" "F"
levels(dfa2$rater2)
#> [1] "A" "B" "C" "D" "E" "F"


# The results are the same as with the numeric data, above
kappa2(dfa2, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.297 
#> 
#>         z = 1.34 
#>   p-value = 0.18


# Use linear weights
kappa2(dfa2, "equal")
#>  Cohen's Kappa for 2 Raters (Weights: equal)
#> 
#>  Subjects = 20 
#>    Raters = 2 
#>     Kappa = 0.189 
#> 
#>         z = 1.42 
#>   p-value = 0.157

Continuous data: Intraclass correlation coefficient

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for icc:

When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):

Should only the subjects be considered as random effects ("oneway" model, default) or are subjects and raters randomly chosen from a bigger pool of persons ("twoway" model).
If differences in judges’ mean ratings are of interest, interrater "agreement" instead of "consistency" (default) should be computed.
If the unit of analysis is a mean of several ratings, unit should be changed to "average". In most cases, however, single values (unit="single", default) are regarded.

We will use the anxiety data set from the irr package.

library(irr)
data(anxiety)
anxiety
#>    rater1 rater2 rater3
#> 1       3      3      2
#> 2       3      6      1
#> 3       3      4      4
#> 4       4      6      4
#> 5       5      2      3
#> 6       5      4      2
#> 7       2      2      1
#> 8       3      4      6
#> 9       5      3      1
#> 10      2      3      1
#> 11      2      2      1
#> 12      6      3      2
#> 13      1      3      3
#> 14      5      3      3
#> 15      2      2      1
#> 16      2      2      1
#> 17      1      1      3
#> 18      2      3      3
#> 19      4      3      2
#> 20      3      4      2

# Just one of the many possible ICC coefficients
icc(anxiety, model="twoway", type="agreement")
#>  Single Score Intraclass Correlation
#> 
#>    Model: twoway 
#>    Type : agreement 
#> 
#>    Subjects = 20 
#>      Raters = 3 
#>    ICC(A,1) = 0.198
#> 
#>  F-Test, H0: r0 = 0 ; H1: r0 > 0 
#>  F(19,39.7) = 1.83 , p = 0.0543 
#> 
#>  95%-Confidence Interval for ICC Population Values:
#>   -0.039 < ICC < 0.494