Problem

You have categorical data and want test whether the frequency distribution of values differs from an expected frequency, or if the distribution differs between groups.

Solution

There are two general questions that frequency tests address:

  1. Does the frequency distribution differ from an expected, or theoretical, distribution (e.g., expect 50% yes, 50% no)? (Goodness-of-fit test)
  2. Does the frequency distribution differ between two or more groups? (Independence test)

Of the statistical tests commonly used to address these questions, there are exact tests and approximate tests.

  Expected distribution Comparing groups
Exact Exact binomial Fisher’s exact
Approximate Chi-square goodness of fit Chi-square test of independence

Note: The exact binomial test can only be used on one variable that has two levels. Fisher’s exact test can only be used with two-dimensional contingency tables (for example, it can be used when there is one independent variable and one dependent variable, but not when there are 2 IVs and 1 DV).

To test for paired or within-subjects effects, McNemar’s test can be used. To use it, there must be one IV with two levels, and one DV with two levels.

To test for the independence of two variables with repeated measurements, the Cochran-Mantel-Haenszel test can be used.

Suppose this is your data, where each row represents one case:

data <- read.table(header=TRUE, text='
 condition result
   control      0
   control      0
   control      0
   control      0
 treatment      1
   control      0
   control      0
 treatment      0
 treatment      1
   control      1
 treatment      1
 treatment      1
 treatment      1
 treatment      1
 treatment      0
   control      0
   control      1
   control      0
   control      1
 treatment      0
 treatment      1
 treatment      0
 treatment      0
   control      0
 treatment      1
   control      0
   control      0
 treatment      1
 treatment      0
 treatment      1
')

Instead of a data frame of cases, your data might be stored as a data frame of counts, or as a contingency table. For the analyses here, your data must be in a contingency table. See this page for more information.

Tests of goodness-of-fit (expected frequency)

Chi-square test

To test the hypothesis that the two values in the result column (ignoring condition) are equally likely (50%-50%) in the population:

# Create contingency table for result, which contains values 0 and 1
# The column names are "0" and "1" (they're not actually values in the table)
ct <- table(data$result)
ct
#> 
#>  0  1 
#> 17 13

# An alternative is to manually create the table
#ct <- matrix(c(17,13), ncol=2)
#colnames(ct1) <- c("0", "1")

# Perform Chi-square test
chisq.test(ct)
#> 
#> 	Chi-squared test for given probabilities
#> 
#> data:  ct
#> X-squared = 0.53333, df = 1, p-value = 0.4652

To test the sample with different expected frequencies (in this case 0.75 and 0.25):

# Probability table - must sum to 1
pt <- c(.75, .25)
chisq.test(ct, p=pt)
#> 
#> 	Chi-squared test for given probabilities
#> 
#> data:  ct
#> X-squared = 5.3778, df = 1, p-value = 0.02039

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example:

chi_res <- chisq.test(ct, p=pt)
# View all the parts that can be extracted
str(chi_res)
#> List of 9
#>  $ statistic: Named num 5.38
#>   ..- attr(*, "names")= chr "X-squared"
#>  $ parameter: Named num 1
#>   ..- attr(*, "names")= chr "df"
#>  $ p.value  : num 0.0204
#>  $ method   : chr "Chi-squared test for given probabilities"
#>  $ data.name: chr "ct"
#>  $ observed : 'table' int [1:2(1d)] 17 13
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:2] "0" "1"
#>  $ expected : Named num [1:2] 22.5 7.5
#>   ..- attr(*, "names")= chr [1:2] "0" "1"
#>  $ residuals: table [1:2(1d)] -1.16 2.01
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:2] "0" "1"
#>  $ stdres   : table [1:2(1d)] -2.32 2.32
#>   ..- attr(*, "dimnames")=List of 1
#>   .. ..$ : chr [1:2] "0" "1"
#>  - attr(*, "class")= chr "htest"

# Get the Chi-squared value
chi_res$statistic
#> X-squared 
#>  5.377778

# Get the p-value
chi_res$p.value
#> [1] 0.02039484

Exact binomial test

The exact binomial test is used only with data where there is one variable with two possible values.

ct <- table(data$result)
ct
#> 
#>  0  1 
#> 17 13

binom.test(ct, p=0.5)
#> 
#> 	Exact binomial test
#> 
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.5847
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success 
#>              0.5666667


# Using expected probability of 75% -- notice that 1 is in the second column of
# the table so need to use p=.25
binom.test(ct, p=0.25)
#> 
#> 	Exact binomial test
#> 
#> data:  ct
#> number of successes = 17, number of trials = 30, p-value = 0.0002157
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#>  0.3742735 0.7453925
#> sample estimates:
#> probability of success 
#>              0.5666667

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example:

bin_res <- binom.test(ct, p=0.25)
# View all the parts that can be extracted
str(bin_res)
#> List of 9
#>  $ statistic  : Named num 17
#>   ..- attr(*, "names")= chr "number of successes"
#>  $ parameter  : Named num 30
#>   ..- attr(*, "names")= chr "number of trials"
#>  $ p.value    : Named num 0.000216
#>   ..- attr(*, "names")= chr "0"
#>  $ conf.int   : atomic [1:2] 0.374 0.745
#>   ..- attr(*, "conf.level")= num 0.95
#>  $ estimate   : Named num 0.567
#>   ..- attr(*, "names")= chr "probability of success"
#>  $ null.value : Named num 0.25
#>   ..- attr(*, "names")= chr "probability of success"
#>  $ alternative: chr "two.sided"
#>  $ method     : chr "Exact binomial test"
#>  $ data.name  : chr "ct"
#>  - attr(*, "class")= chr "htest"

# Get the p-value
bin_res$p.value
#>            0 
#> 0.0002156938

# Get the 95% confidence interval
bin_res$conf.int
#> [1] 0.3742735 0.7453925
#> attr(,"conf.level")
#> [1] 0.95

Tests of independence (comparing groups)

Chi-square test

To test whether the control and treatment conditions result in different frequencies, use a 2D contingency table.

ct <- table(data$condition, data$result)
ct
#>            
#>              0  1
#>   control   11  3
#>   treatment  6 10

chisq.test(ct)
#> 
#> 	Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  ct
#> X-squared = 3.593, df = 1, p-value = 0.05802

chisq.test(ct, correct=FALSE)
#> 
#> 	Pearson's Chi-squared test
#> 
#> data:  ct
#> X-squared = 5.1293, df = 1, p-value = 0.02353

For 2x2 tables, it uses Yates’s continuity correction by default. This test is more conservative for small sample sizes. The flag correct=FALSE, tells it to use Pearson’s chi-square test without the correction.

Fisher’s exact test

For small sample sizes, Fisher’s exact test may be more appropriate. It is typically used for 2x2 tables with relatively small sample sizes because it is computationally intensive for more complicated (e.g., 2x3) tables, and those with larger sample sizes. However, the implementation in R seems to handle larger cases without trouble.

ct <- table(data$condition, data$result)
ct
#>            
#>              0  1
#>   control   11  3
#>   treatment  6 10

fisher.test(ct)
#> 
#> 	Fisher's Exact Test for Count Data
#> 
#> data:  ct
#> p-value = 0.03293
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>   0.966861 45.555016
#> sample estimates:
#> odds ratio 
#>   5.714369

Cochran-Mantel-Haenszel test

The Cochran-Mantel-Haenszel test (or Mantel-Haenszel test) is used for testing the independence of two dichotomous variables with repeated measurements. It is most commonly used with 2x2xK tables, where K is the number of measurement conditions. For example, you may want to know whether a treatment (C vs. D) affects the likelihood of recovery (yes or no). Suppose, though, that the treatments were administered at three different times of day, morning, afternoon, and night – and that you want your test to control for this. The CMH test would then operate on a 2x2x3 contingency table, where the third variable is the one you wish to control for.

The implementation of the CMH test in R can handle dimensions greater than 2x2xK. For example, you could use it for a 3x3xK contingency table.

In the following example (borrowed from McDonald’s Handbook of Biological Statistics), there are three variables: Location, Allele, and Habitat. The question is whether Allele (94 or non-94) and Habitat (marine or estuarine) are independent, when location is controlled for.

fish <- read.table(header=TRUE, text='
  Location Allele   Habitat Count
 tillamook     94    marine    56
 tillamook     94 estuarine    69
 tillamook non-94    marine    40
 tillamook non-94 estuarine    77
   yaquina     94    marine    61
   yaquina     94 estuarine   257
   yaquina non-94    marine    57
   yaquina non-94 estuarine   301
     alsea     94    marine    73
     alsea     94 estuarine    65
     alsea non-94    marine    71
     alsea non-94 estuarine    79
    umpqua     94    marine    71
    umpqua     94 estuarine    48
    umpqua non-94    marine    55
    umpqua non-94 estuarine    48
')

Note that the data above is entered as a data frame of counts, instead of a data frame of cases as in previous examples. Instead of using table() to convert it to a contingency table, use xtabs() instead. For more information, see this page.


# Make a 3D contingency table, where the last variable, Location, is the one to 
# control for. If you use table() for case data, the last variable is also the 
# one to control for.
ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
ct
#> , , Location = alsea
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            65     73
#>   non-94        79     71
#> 
#> , , Location = tillamook
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            69     56
#>   non-94        77     40
#> 
#> , , Location = umpqua
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            48     71
#>   non-94        48     55
#> 
#> , , Location = yaquina
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94           257     61
#>   non-94       301     57

# This prints ct in a "flat" format
ftable(ct)
#>                  Location alsea tillamook umpqua yaquina
#> Allele Habitat                                          
#> 94     estuarine             65        69     48     257
#>        marine                73        56     71      61
#> non-94 estuarine             79        77     48     301
#>        marine                71        40     55      57

# Print with different arrangement of variables
ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
#>                  Habitat estuarine marine
#> Location  Allele                         
#> alsea     94                    65     73
#>           non-94                79     71
#> tillamook 94                    69     56
#>           non-94                77     40
#> umpqua    94                    48     71
#>           non-94                48     55
#> yaquina   94                   257     61
#>           non-94               301     57


mantelhaen.test(ct)
#> 
#> 	Mantel-Haenszel chi-squared test with continuity correction
#> 
#> data:  ct
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio 
#>          0.759022

According to this test, there is a relationship between Allele and Habitat, controlling for Location, p=.025.

Note that the first two dimensions of the contingency table are treated the same (so their order can be swapped without affecting the test result), the highest-order dimension in the contingency table is different. This is illustrated below.

# The following two create different contingency tables, but have the same result
# with the CMH test
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
mantelhaen.test(ct.1)
#> 
#> 	Mantel-Haenszel chi-squared test with continuity correction
#> 
#> data:  ct.1
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio 
#>          0.759022
mantelhaen.test(ct.2)
#> 
#> 	Mantel-Haenszel chi-squared test with continuity correction
#> 
#> data:  ct.2
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6005522 0.9593077
#> sample estimates:
#> common odds ratio 
#>          0.759022


# With Allele last, we get a different result
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
mantelhaen.test(ct.3)
#> 
#> 	Cochran-Mantel-Haenszel test
#> 
#> data:  ct.3
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
mantelhaen.test(ct.4)
#> 
#> 	Cochran-Mantel-Haenszel test
#> 
#> data:  ct.4
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16


# With Habitat last, we get a different result
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
mantelhaen.test(ct.5)
#> 
#> 	Cochran-Mantel-Haenszel test
#> 
#> data:  ct.5
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
mantelhaen.test(ct.6)
#> 
#> 	Cochran-Mantel-Haenszel test
#> 
#> data:  ct.6
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689

McNemar’s test

McNemar’s test is conceptually like a within-subjects test for frequency data. For example, suppose you want to test whether a treatment increases the probability that a person will respond “yes” to a question, and that you get just one pre-treatment and one post-treatment response per person. A standard chi-square test would be inappropriate, because it assumes that the groups are independent. Instead, McNemar’s test can be used. This test can only be used when there are two measurements of a dichotomous variable. The 2x2 contingency table used for McNemar’s test bears a superficial resemblance to those used for “normal” chi-square tests, but it is different in structure.

Suppose this is your data. Each subject has a pre-treatment and post-treatment response.

data <- read.table(header=TRUE, text='
 subject time result
       1  pre      0
       1 post      1
       2  pre      1
       2 post      1
       3  pre      0
       3 post      1
       4  pre      1
       4 post      0
       5  pre      1
       5 post      1
       6  pre      0
       6 post      1
       7  pre      0
       7 post      1
       8  pre      0
       8 post      1
       9  pre      0
       9 post      1
      10  pre      1
      10 post      1
      11  pre      0
      11 post      0
      12  pre      1
      12 post      1
      13  pre      0
      13 post      1
      14  pre      0
      14 post      0
      15  pre      0
      15 post      1
')

If your data is not already in wide format, it must be converted (see this page for more information):

library(tidyr)

data_wide <- spread(data, time, result)
data_wide
#>    subject post pre
#> 1        1    1   0
#> 2        2    1   1
#> 3        3    1   0
#> 4        4    0   1
#> 5        5    1   1
#> 6        6    1   0
#> 7        7    1   0
#> 8        8    1   0
#> 9        9    1   0
#> 10      10    1   1
#> 11      11    0   0
#> 12      12    1   1
#> 13      13    1   0
#> 14      14    0   0
#> 15      15    1   0

Next, generate the contingency table from just the pre and post columns from the data frame:

ct <- table( data_wide[,c("pre","post")] )
ct
#>    post
#> pre 0 1
#>   0 2 8
#>   1 1 4

# The contingency table above puts each subject into one of four cells, depending
# on their pre- and post-treatment response. Note that it it is different from
# the contingency table used for a "normal" chi-square test, shown below. The table
# below does not account for repeated measurements, and so it is not useful for
# the purposes here.

# table(data[,c("time","result")])
#       result
# time    0  1
#   post  3 12
#   pre  10  5

After creating the appropriate contingency table, run the test:

mcnemar.test(ct)
#> 
#> 	McNemar's Chi-squared test with continuity correction
#> 
#> data:  ct
#> McNemar's chi-squared = 4, df = 1, p-value = 0.0455

For small sample sizes, it uses a continuity correction. Instead of using this correction, you can use an exact version of McNemar’s test, which is more accurate. It is available in the package exact2x2.

library(exact2x2)
#> Loading required package: exactci
#> Loading required package: ssanv
mcnemar.exact(ct)
#> 
#> 	Exact McNemar test (with central confidence intervals)
#> 
#> data:  ct
#> b = 8, c = 1, p-value = 0.03906
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>    1.072554 354.981246
#> sample estimates:
#> odds ratio 
#>          8