## Problem

You have categorical data and want test whether the frequency distribution of values differs from an expected frequency, or if the distribution differs between groups.

## Solution

There are two general questions that frequency tests address:

1. Does the frequency distribution differ from an expected, or theoretical, distribution (e.g., expect 50% yes, 50% no)? (Goodness-of-fit test)
2. Does the frequency distribution differ between two or more groups? (Independence test)

Of the statistical tests commonly used to address these questions, there are exact tests and approximate tests.

Expected distribution Comparing groups
Exact Exact binomial Fisher’s exact
Approximate Chi-square goodness of fit Chi-square test of independence

Note: The exact binomial test can only be used on one variable that has two levels. Fisher’s exact test can only be used with two-dimensional contingency tables (for example, it can be used when there is one independent variable and one dependent variable, but not when there are 2 IVs and 1 DV).

To test for paired or within-subjects effects, McNemar’s test can be used. To use it, there must be one IV with two levels, and one DV with two levels.

To test for the independence of two variables with repeated measurements, the Cochran-Mantel-Haenszel test can be used.

Suppose this is your data, where each row represents one case:

``````data <- read.table(header=TRUE, text='
condition result
control      0
control      0
control      0
control      0
treatment      1
control      0
control      0
treatment      0
treatment      1
control      1
treatment      1
treatment      1
treatment      1
treatment      1
treatment      0
control      0
control      1
control      0
control      1
treatment      0
treatment      1
treatment      0
treatment      0
control      0
treatment      1
control      0
control      0
treatment      1
treatment      0
treatment      1
')
``````

Instead of a data frame of cases, your data might be stored as a data frame of counts, or as a contingency table. For the analyses here, your data must be in a contingency table. See this page for more information.

### Tests of goodness-of-fit (expected frequency)

#### Chi-square test

To test the hypothesis that the two values in the result column (ignoring condition) are equally likely (50%-50%) in the population:

``````# Create contingency table for result, which contains values 0 and 1
# The column names are "0" and "1" (they're not actually values in the table)
ct <- table(data\$result)
ct
#>
#>  0  1
#> 17 13

# An alternative is to manually create the table
#ct <- matrix(c(17,13), ncol=2)
#colnames(ct1) <- c("0", "1")

# Perform Chi-square test
chisq.test(ct)
#>
#> 	Chi-squared test for given probabilities
#>
#> data:  ct
#> X-squared = 0.53333, df = 1, p-value = 0.4652
``````

To test the sample with different expected frequencies (in this case 0.75 and 0.25):

``````# Probability table - must sum to 1
pt <- c(.75, .25)
chisq.test(ct, p=pt)
#>
#> 	Chi-squared test for given probabilities
#>
#> data:  ct
#> X-squared = 5.3778, df = 1, p-value = 0.02039
``````

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with `str()`, and pull out the parts you want. For example:

``````chi_res <- chisq.test(ct, p=pt)
# View all the parts that can be extracted
str(chi_res)
#> List of 9
#>  \$ statistic: Named num 5.38
#>   ..- attr(*, "names")= chr "X-squared"
#>  \$ parameter: Named num 1
#>   ..- attr(*, "names")= chr "df"
#>  \$ p.value  : num 0.0204
#>  \$ method   : chr "Chi-squared test for given probabilities"
#>  \$ data.name: chr "ct"
#>  \$ observed : 'table' int [1:2(1d)] 17 13
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  \$ expected : Named num [1:2] 22.5 7.5
#>   ..- attr(*, "names")= chr [1:2] "0" "1"
#>  \$ residuals: table [1:2(1d)] -1.16 2.01
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  \$ stdres   : table [1:2(1d)] -2.32 2.32
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..\$ : chr [1:2] "0" "1"
#>  - attr(*, "class")= chr "htest"

# Get the Chi-squared value
chi_res\$statistic
#> X-squared
#>  5.377778

# Get the p-value
chi_res\$p.value
#> [1] 0.02039484
``````

#### Exact binomial test

The exact binomial test is used only with data where there is one variable with two possible values.

``````ct <- table(data\$result)
ct
#>
#>  0  1
#> 17 13

binom.test(ct, p=0.5)
#>
#> 	Exact binomial test
#>
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.5847
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success
#>              0.5666667

# Using expected probability of 75% -- notice that 1 is in the second column of
# the table so need to use p=.25
binom.test(ct, p=0.25)
#>
#> 	Exact binomial test
#>
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.0002157
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success
#>              0.5666667
``````

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with `str()`, and pull out the parts you want. For example:

``````bin_res <- binom.test(ct, p=0.25)
# View all the parts that can be extracted
str(bin_res)
#> List of 9
#>  \$ statistic  : Named num 17
#>   ..- attr(*, "names")= chr "number of successes"
#>  \$ parameter  : Named num 30
#>   ..- attr(*, "names")= chr "number of trials"
#>  \$ p.value    : Named num 0.000216
#>   ..- attr(*, "names")= chr "0"
#>  \$ conf.int   : atomic [1:2] 0.374 0.745
#>   ..- attr(*, "conf.level")= num 0.95
#>  \$ estimate   : Named num 0.567
#>   ..- attr(*, "names")= chr "probability of success"
#>  \$ null.value : Named num 0.25
#>   ..- attr(*, "names")= chr "probability of success"
#>  \$ alternative: chr "two.sided"
#>  \$ method     : chr "Exact binomial test"
#>  \$ data.name  : chr "ct"
#>  - attr(*, "class")= chr "htest"

# Get the p-value
bin_res\$p.value
#>            0
#> 0.0002156938

# Get the 95% confidence interval
bin_res\$conf.int
#> [1] 0.3742735 0.7453925
#> attr(,"conf.level")
#> [1] 0.95
``````

### Tests of independence (comparing groups)

#### Chi-square test

To test whether the control and treatment conditions result in different frequencies, use a 2D contingency table.

``````ct <- table(data\$condition, data\$result)
ct
#>
#>              0  1
#>   control   11  3
#>   treatment  6 10

chisq.test(ct)
#>
#> 	Pearson's Chi-squared test with Yates' continuity correction
#>
#> data:  ct
#> X-squared = 3.593, df = 1, p-value = 0.05802

chisq.test(ct, correct=FALSE)
#>
#> 	Pearson's Chi-squared test
#>
#> data:  ct
#> X-squared = 5.1293, df = 1, p-value = 0.02353
``````

For 2x2 tables, it uses Yates’s continuity correction by default. This test is more conservative for small sample sizes. The flag `correct=FALSE`, tells it to use Pearson’s chi-square test without the correction.

#### Fisher’s exact test

For small sample sizes, Fisher’s exact test may be more appropriate. It is typically used for 2x2 tables with relatively small sample sizes because it is computationally intensive for more complicated (e.g., 2x3) tables, and those with larger sample sizes. However, the implementation in R seems to handle larger cases without trouble.

``````ct <- table(data\$condition, data\$result)
ct
#>
#>              0  1
#>   control   11  3
#>   treatment  6 10

fisher.test(ct)
#>
#> 	Fisher's Exact Test for Count Data
#>
#> data:  ct
#> p-value = 0.03293
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>   0.966861 45.555016
#> sample estimates:
#> odds ratio
#>   5.714369
``````

#### Cochran-Mantel-Haenszel test

The Cochran-Mantel-Haenszel test (or Mantel-Haenszel test) is used for testing the independence of two dichotomous variables with repeated measurements. It is most commonly used with 2x2xK tables, where K is the number of measurement conditions. For example, you may want to know whether a treatment (C vs. D) affects the likelihood of recovery (yes or no). Suppose, though, that the treatments were administered at three different times of day, morning, afternoon, and night – and that you want your test to control for this. The CMH test would then operate on a 2x2x3 contingency table, where the third variable is the one you wish to control for.

The implementation of the CMH test in R can handle dimensions greater than 2x2xK. For example, you could use it for a 3x3xK contingency table.

In the following example (borrowed from McDonald’s Handbook of Biological Statistics), there are three variables: Location, Allele, and Habitat. The question is whether Allele (94 or non-94) and Habitat (marine or estuarine) are independent, when location is controlled for.

``````fish <- read.table(header=TRUE, text='
Location Allele   Habitat Count
tillamook     94    marine    56
tillamook     94 estuarine    69
tillamook non-94    marine    40
tillamook non-94 estuarine    77
yaquina     94    marine    61
yaquina     94 estuarine   257
yaquina non-94    marine    57
yaquina non-94 estuarine   301
alsea     94    marine    73
alsea     94 estuarine    65
alsea non-94    marine    71
alsea non-94 estuarine    79
umpqua     94    marine    71
umpqua     94 estuarine    48
umpqua non-94    marine    55
umpqua non-94 estuarine    48
')
``````

Note that the data above is entered as a data frame of counts, instead of a data frame of cases as in previous examples. Instead of using `table()` to convert it to a contingency table, use `xtabs()` instead. For more information, see this page.

``````
# Make a 3D contingency table, where the last variable, Location, is the one to
# control for. If you use table() for case data, the last variable is also the
# one to control for.
ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
ct
#> , , Location = alsea
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            65     73
#>   non-94        79     71
#>
#> , , Location = tillamook
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            69     56
#>   non-94        77     40
#>
#> , , Location = umpqua
#>
#>         Habitat
#> Allele   estuarine marine
#>   94            48     71
#>   non-94        48     55
#>
#> , , Location = yaquina
#>
#>         Habitat
#> Allele   estuarine marine
#>   94           257     61
#>   non-94       301     57

# This prints ct in a "flat" format
ftable(ct)
#>                  Location alsea tillamook umpqua yaquina
#> Allele Habitat
#> 94     estuarine             65        69     48     257
#>        marine                73        56     71      61
#> non-94 estuarine             79        77     48     301
#>        marine                71        40     55      57

# Print with different arrangement of variables
ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
#>                  Habitat estuarine marine
#> Location  Allele
#> alsea     94                    65     73
#>           non-94                79     71
#> tillamook 94                    69     56
#>           non-94                77     40
#> umpqua    94                    48     71
#>           non-94                48     55
#> yaquina   94                   257     61
#>           non-94               301     57

mantelhaen.test(ct)
#>
#> 	Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022
``````

According to this test, there is a relationship between Allele and Habitat, controlling for Location, p=.025.

Note that the first two dimensions of the contingency table are treated the same (so their order can be swapped without affecting the test result), the highest-order dimension in the contingency table is different. This is illustrated below.

``````# The following two create different contingency tables, but have the same result
# with the CMH test
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
mantelhaen.test(ct.1)
#>
#> 	Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct.1
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022
mantelhaen.test(ct.2)
#>
#> 	Mantel-Haenszel chi-squared test with continuity correction
#>
#> data:  ct.2
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#>          0.759022

# With Allele last, we get a different result
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
mantelhaen.test(ct.3)
#>
#> 	Cochran-Mantel-Haenszel test
#>
#> data:  ct.3
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
mantelhaen.test(ct.4)
#>
#> 	Cochran-Mantel-Haenszel test
#>
#> data:  ct.4
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16

# With Habitat last, we get a different result
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
mantelhaen.test(ct.5)
#>
#> 	Cochran-Mantel-Haenszel test
#>
#> data:  ct.5
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
mantelhaen.test(ct.6)
#>
#> 	Cochran-Mantel-Haenszel test
#>
#> data:  ct.6
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
``````

### McNemar’s test

McNemar’s test is conceptually like a within-subjects test for frequency data. For example, suppose you want to test whether a treatment increases the probability that a person will respond “yes” to a question, and that you get just one pre-treatment and one post-treatment response per person. A standard chi-square test would be inappropriate, because it assumes that the groups are independent. Instead, McNemar’s test can be used. This test can only be used when there are two measurements of a dichotomous variable. The 2x2 contingency table used for McNemar’s test bears a superficial resemblance to those used for “normal” chi-square tests, but it is different in structure.

Suppose this is your data. Each subject has a pre-treatment and post-treatment response.

``````data <- read.table(header=TRUE, text='
subject time result
1  pre      0
1 post      1
2  pre      1
2 post      1
3  pre      0
3 post      1
4  pre      1
4 post      0
5  pre      1
5 post      1
6  pre      0
6 post      1
7  pre      0
7 post      1
8  pre      0
8 post      1
9  pre      0
9 post      1
10  pre      1
10 post      1
11  pre      0
11 post      0
12  pre      1
12 post      1
13  pre      0
13 post      1
14  pre      0
14 post      0
15  pre      0
15 post      1
')
``````

``````library(tidyr)

data_wide
#>    subject post pre
#> 1        1    1   0
#> 2        2    1   1
#> 3        3    1   0
#> 4        4    0   1
#> 5        5    1   1
#> 6        6    1   0
#> 7        7    1   0
#> 8        8    1   0
#> 9        9    1   0
#> 10      10    1   1
#> 11      11    0   0
#> 12      12    1   1
#> 13      13    1   0
#> 14      14    0   0
#> 15      15    1   0
``````

Next, generate the contingency table from just the pre and post columns from the data frame:

``````ct <- table( data_wide[,c("pre","post")] )
ct
#>    post
#> pre 0 1
#>   0 2 8
#>   1 1 4

# The contingency table above puts each subject into one of four cells, depending
# on their pre- and post-treatment response. Note that it it is different from
# the contingency table used for a "normal" chi-square test, shown below. The table
# below does not account for repeated measurements, and so it is not useful for
# the purposes here.

# table(data[,c("time","result")])
#       result
# time    0  1
#   post  3 12
#   pre  10  5
``````

After creating the appropriate contingency table, run the test:

``````mcnemar.test(ct)
#>
#> 	McNemar's Chi-squared test with continuity correction
#>
#> data:  ct
#> McNemar's chi-squared = 4, df = 1, p-value = 0.0455
``````

For small sample sizes, it uses a continuity correction. Instead of using this correction, you can use an exact version of McNemar’s test, which is more accurate. It is available in the package `exact2x2`.

``````library(exact2x2)