## Problem

You want to calculate inter-rater reliability.

## Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

### Categorical data

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the `diagnoses` data set in the irr package.

``````library(irr)

data(diagnoses)
dat <- diagnoses[,1:3]
#                  rater1                  rater2                  rater3
#             4. Neurosis             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression           1. Depression        3. Schizophrenia
#        3. Schizophrenia        3. Schizophrenia        3. Schizophrenia
#           1. Depression           1. Depression        3. Schizophrenia
#           1. Depression           1. Depression             4. Neurosis
#                5. Other                5. Other                5. Other
#           1. Depression             4. Neurosis             4. Neurosis
#           1. Depression 2. Personality Disorder             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#        3. Schizophrenia        3. Schizophrenia        3. Schizophrenia
#           1. Depression           1. Depression           1. Depression
#           1. Depression           1. Depression           1. Depression
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other
# 2. Personality Disorder             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression           1. Depression             4. Neurosis
#           1. Depression             4. Neurosis             4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
#           1. Depression           1. Depression           1. Depression
# 2. Personality Disorder 2. Personality Disorder             4. Neurosis
#           1. Depression        3. Schizophrenia        3. Schizophrenia
#                5. Other                5. Other                5. Other
``````

#### Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

``````kappa2(dat[,c(1,2)], "unweighted")
#>  Cohen's Kappa for 2 Raters (Weights: unweighted)
#>
#>  Subjects = 30
#>    Raters = 2
#>     Kappa = 0.651
#>
#>         z = 7
#>   p-value = 2.63e-12
``````

#### N raters: Fleiss’s Kappa, Conger’s Kappa

If there are more than two raters, use Fleiss’s Kappa.

``````kappam.fleiss(dat)
#>  Fleiss' Kappa for m Raters
#>
#>  Subjects = 30
#>    Raters = 3
#>     Kappa = 0.534
#>
#>         z = 9.89
#>   p-value = 0
``````

It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

``````kappam.fleiss(dat, exact=TRUE)
#>  Fleiss' Kappa for m Raters (exact value)
#>
#>  Subjects = 30
#>    Raters = 3
#>     Kappa = 0.55
``````

### Ordinal data: weighted Kappa

If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

We will use a subset of the `anxiety` data set from the irr package.

``````library(irr)
data(anxiety)

dfa <- anxiety[,c(1,2)]
dfa
#>    rater1 rater2
#> 1       3      3
#> 2       3      6
#> 3       3      4
#> 4       4      6
#> 5       5      2
#> 6       5      4
#> 7       2      2
#> 8       3      4
#> 9       5      3
#> 10      2      3
#> 11      2      2
#> 12      6      3
#> 13      1      3
#> 14      5      3
#> 15      2      2
#> 16      2      2
#> 17      1      1
#> 18      2      3
#> 19      4      3
#> 20      3      4
``````

The weighted Kappa calculation must be made with 2 raters, and can use either linear or squared weights of the differences.

``````# Compare raters 1 and 2 with squared weights
kappa2(dfa, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#>
#>  Subjects = 20
#>    Raters = 2
#>     Kappa = 0.297
#>
#>         z = 1.34
#>   p-value = 0.18

# Use linear weights
kappa2(dfa, "equal")
#>  Cohen's Kappa for 2 Raters (Weights: equal)
#>
#>  Subjects = 20
#>    Raters = 2
#>     Kappa = 0.189
#>
#>         z = 1.42
#>   p-value = 0.157
``````

Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

``````kappa2(dfa, "unweighted")
#>  Cohen's Kappa for 2 Raters (Weights: unweighted)
#>
#>  Subjects = 20
#>    Raters = 2
#>     Kappa = 0.119
#>
#>         z = 1.16
#>   p-value = 0.245
``````

#### Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

``````# Make a factor-ized version of the data
dfa2 <- dfa
dfa2\$rater1 <- factor(dfa2\$rater1, levels=1:6, labels=LETTERS[1:6])
dfa2\$rater2 <- factor(dfa2\$rater2, levels=1:6, labels=LETTERS[1:6])
dfa2
#>    rater1 rater2
#> 1       C      C
#> 2       C      F
#> 3       C      D
#> 4       D      F
#> 5       E      B
#> 6       E      D
#> 7       B      B
#> 8       C      D
#> 9       E      C
#> 10      B      C
#> 11      B      B
#> 12      F      C
#> 13      A      C
#> 14      E      C
#> 15      B      B
#> 16      B      B
#> 17      A      A
#> 18      B      C
#> 19      D      C
#> 20      C      D

# The factor levels must be in the correct order:
levels(dfa2\$rater1)
#> [1] "A" "B" "C" "D" "E" "F"
levels(dfa2\$rater2)
#> [1] "A" "B" "C" "D" "E" "F"

# The results are the same as with the numeric data, above
kappa2(dfa2, "squared")
#>  Cohen's Kappa for 2 Raters (Weights: squared)
#>
#>  Subjects = 20
#>    Raters = 2
#>     Kappa = 0.297
#>
#>         z = 1.34
#>   p-value = 0.18

# Use linear weights
kappa2(dfa2, "equal")
#>  Cohen's Kappa for 2 Raters (Weights: equal)
#>
#>  Subjects = 20
#>    Raters = 2
#>     Kappa = 0.189
#>
#>         z = 1.42
#>   p-value = 0.157
``````

### Continuous data: Intraclass correlation coefficient

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for `icc`:

When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):

1. Should only the subjects be considered as random effects (`"oneway"` model, default) or are subjects and raters randomly chosen from a bigger pool of persons (`"twoway"` model).
2. If differences in judges’ mean ratings are of interest, interrater `"agreement"` instead of `"consistency"` (default) should be computed.
3. If the unit of analysis is a mean of several ratings, `unit` should be changed to `"average"`. In most cases, however, single values (`unit="single"`, default) are regarded.

We will use the `anxiety` data set from the irr package.

``````library(irr)
data(anxiety)
anxiety
#>    rater1 rater2 rater3
#> 1       3      3      2
#> 2       3      6      1
#> 3       3      4      4
#> 4       4      6      4
#> 5       5      2      3
#> 6       5      4      2
#> 7       2      2      1
#> 8       3      4      6
#> 9       5      3      1
#> 10      2      3      1
#> 11      2      2      1
#> 12      6      3      2
#> 13      1      3      3
#> 14      5      3      3
#> 15      2      2      1
#> 16      2      2      1
#> 17      1      1      3
#> 18      2      3      3
#> 19      4      3      2
#> 20      3      4      2

# Just one of the many possible ICC coefficients
icc(anxiety, model="twoway", type="agreement")
#>  Single Score Intraclass Correlation
#>
#>    Model: twoway
#>    Type : agreement
#>
#>    Subjects = 20
#>      Raters = 3
#>    ICC(A,1) = 0.198
#>
#>  F-Test, H0: r0 = 0 ; H1: r0 > 0
#>  F(19,39.7) = 1.83 , p = 0.0543
#>
#>  95%-Confidence Interval for ICC Population Values:
#>   -0.039 < ICC < 0.494
``````