# Inter-rater reliability

## Problem

You want to calculate inter-rater reliability.

## Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

### Categorical data

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the `diagnoses`

data set in the irr package.

```
library(irr)
#> Loading required package: lpSolve
data(diagnoses)
dat <- diagnoses[,1:3]
# rater1 rater2 rater3
# 4. Neurosis 4. Neurosis 4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia
# 5. Other 5. Other 5. Other
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 1. Depression 1. Depression 3. Schizophrenia
# 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia
# 1. Depression 1. Depression 3. Schizophrenia
# 1. Depression 1. Depression 4. Neurosis
# 5. Other 5. Other 5. Other
# 1. Depression 4. Neurosis 4. Neurosis
# 1. Depression 2. Personality Disorder 4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 1. Depression 4. Neurosis 4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 4. Neurosis
# 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia
# 1. Depression 1. Depression 1. Depression
# 1. Depression 1. Depression 1. Depression
# 2. Personality Disorder 2. Personality Disorder 4. Neurosis
# 1. Depression 3. Schizophrenia 3. Schizophrenia
# 5. Other 5. Other 5. Other
# 2. Personality Disorder 4. Neurosis 4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 4. Neurosis
# 1. Depression 1. Depression 4. Neurosis
# 1. Depression 4. Neurosis 4. Neurosis
# 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
# 1. Depression 1. Depression 1. Depression
# 2. Personality Disorder 2. Personality Disorder 4. Neurosis
# 1. Depression 3. Schizophrenia 3. Schizophrenia
# 5. Other 5. Other 5. Other
```

#### Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

```
kappa2(dat[,c(1,2)], "unweighted")
#> Cohen's Kappa for 2 Raters (Weights: unweighted)
#>
#> Subjects = 30
#> Raters = 2
#> Kappa = 0.651
#>
#> z = 7
#> p-value = 2.63e-12
```

#### N raters: Fleiss’s Kappa, Conger’s Kappa

If there are more than two raters, use Fleiss’s Kappa.

```
kappam.fleiss(dat)
#> Fleiss' Kappa for m Raters
#>
#> Subjects = 30
#> Raters = 3
#> Kappa = 0.534
#>
#> z = 9.89
#> p-value = 0
```

It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

```
kappam.fleiss(dat, exact=TRUE)
#> Fleiss' Kappa for m Raters (exact value)
#>
#> Subjects = 30
#> Raters = 3
#> Kappa = 0.55
```

### Ordinal data: weighted Kappa

If the data is ordinal, then it may be appropriate to use a **weighted** Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

We will use a subset of the `anxiety`

data set from the irr package.

```
library(irr)
data(anxiety)
dfa <- anxiety[,c(1,2)]
dfa
#> rater1 rater2
#> 1 3 3
#> 2 3 6
#> 3 3 4
#> 4 4 6
#> 5 5 2
#> 6 5 4
#> 7 2 2
#> 8 3 4
#> 9 5 3
#> 10 2 3
#> 11 2 2
#> 12 6 3
#> 13 1 3
#> 14 5 3
#> 15 2 2
#> 16 2 2
#> 17 1 1
#> 18 2 3
#> 19 4 3
#> 20 3 4
```

The weighted Kappa calculation must be made with 2 raters, and can use either **linear** or **squared** weights of the differences.

```
# Compare raters 1 and 2 with squared weights
kappa2(dfa, "squared")
#> Cohen's Kappa for 2 Raters (Weights: squared)
#>
#> Subjects = 20
#> Raters = 2
#> Kappa = 0.297
#>
#> z = 1.34
#> p-value = 0.18
# Use linear weights
kappa2(dfa, "equal")
#> Cohen's Kappa for 2 Raters (Weights: equal)
#>
#> Subjects = 20
#> Raters = 2
#> Kappa = 0.189
#>
#> z = 1.42
#> p-value = 0.157
```

Compare the results above to the **unweighted** calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

```
kappa2(dfa, "unweighted")
#> Cohen's Kappa for 2 Raters (Weights: unweighted)
#>
#> Subjects = 20
#> Raters = 2
#> Kappa = 0.119
#>
#> z = 1.16
#> p-value = 0.245
```

#### Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

```
# Make a factor-ized version of the data
dfa2 <- dfa
dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6])
dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6])
dfa2
#> rater1 rater2
#> 1 C C
#> 2 C F
#> 3 C D
#> 4 D F
#> 5 E B
#> 6 E D
#> 7 B B
#> 8 C D
#> 9 E C
#> 10 B C
#> 11 B B
#> 12 F C
#> 13 A C
#> 14 E C
#> 15 B B
#> 16 B B
#> 17 A A
#> 18 B C
#> 19 D C
#> 20 C D
# The factor levels must be in the correct order:
levels(dfa2$rater1)
#> [1] "A" "B" "C" "D" "E" "F"
levels(dfa2$rater2)
#> [1] "A" "B" "C" "D" "E" "F"
# The results are the same as with the numeric data, above
kappa2(dfa2, "squared")
#> Cohen's Kappa for 2 Raters (Weights: squared)
#>
#> Subjects = 20
#> Raters = 2
#> Kappa = 0.297
#>
#> z = 1.34
#> p-value = 0.18
# Use linear weights
kappa2(dfa2, "equal")
#> Cohen's Kappa for 2 Raters (Weights: equal)
#>
#> Subjects = 20
#> Raters = 2
#> Kappa = 0.189
#>
#> z = 1.42
#> p-value = 0.157
```

### Continuous data: Intraclass correlation coefficient

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for `icc`

:

When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):

- Should only the subjects be considered as random effects (
`"oneway"`

model, default) or are subjects and raters randomly chosen from a bigger pool of persons (`"twoway"`

model). - If differences in judges’ mean ratings are of interest, interrater
`"agreement"`

instead of`"consistency"`

(default) should be computed. - If the unit of analysis is a mean of several ratings,
`unit`

should be changed to`"average"`

. In most cases, however, single values (`unit="single"`

, default) are regarded.

We will use the `anxiety`

data set from the irr package.

```
library(irr)
data(anxiety)
anxiety
#> rater1 rater2 rater3
#> 1 3 3 2
#> 2 3 6 1
#> 3 3 4 4
#> 4 4 6 4
#> 5 5 2 3
#> 6 5 4 2
#> 7 2 2 1
#> 8 3 4 6
#> 9 5 3 1
#> 10 2 3 1
#> 11 2 2 1
#> 12 6 3 2
#> 13 1 3 3
#> 14 5 3 3
#> 15 2 2 1
#> 16 2 2 1
#> 17 1 1 3
#> 18 2 3 3
#> 19 4 3 2
#> 20 3 4 2
# Just one of the many possible ICC coefficients
icc(anxiety, model="twoway", type="agreement")
#> Single Score Intraclass Correlation
#>
#> Model: twoway
#> Type : agreement
#>
#> Subjects = 20
#> Raters = 3
#> ICC(A,1) = 0.198
#>
#> F-Test, H0: r0 = 0 ; H1: r0 > 0
#> F(19,39.7) = 1.83 , p = 0.0543
#>
#> 95%-Confidence Interval for ICC Population Values:
#> -0.039 < ICC < 0.494
```