## Problem

You want to do summarize your data (with mean, standard deviation, etc.), broken down by group.

## Solution

There are three ways described here to group data based on some specified variables, and apply a summary function (like mean, standard deviation, etc.) to each group.

• The `ddply()` function. It is the easiest to use, though it requires the `plyr` package. This is probably what you want to use.
• The `summarizeBy()` function. It is easier to use, though it requires the `doBy` package.
• The `aggregate()` function. It is more difficult to use but is included in the base install of R.

Suppose you have this data and want to find the N, mean of change, standard deviation, and standard error of the mean for each group, where the groups are specified by each combination of sex and condition: F-placebo, F-aspirin, M-placebo, and M-aspirin.

``````data <- read.table(header=TRUE, text='
subject sex condition before after change
1   F   placebo   10.1   6.9   -3.2
2   F   placebo    6.3   4.2   -2.1
3   M   aspirin   12.4   6.3   -6.1
4   F   placebo    8.1   6.1   -2.0
5   M   aspirin   15.2   9.9   -5.3
6   F   aspirin   10.9   7.0   -3.9
7   F   aspirin   11.6   8.5   -3.1
8   M   aspirin    9.5   3.0   -6.5
9   F   placebo   11.5   9.0   -2.5
10   M   placebo   11.9  11.0   -0.9
11   F   aspirin   11.4   8.0   -3.4
12   M   aspirin   10.0   4.4   -5.6
13   M   aspirin   12.5   5.4   -7.1
14   M   placebo   10.6  10.6    0.0
15   M   aspirin    9.1   4.3   -4.8
16   F   placebo   12.1  10.2   -1.9
17   F   placebo   11.0   8.8   -2.2
18   F   placebo   11.9  10.2   -1.7
19   M   aspirin    9.1   3.6   -5.5
20   M   placebo   13.5  12.4   -1.1
21   M   aspirin   12.0   7.5   -4.5
22   F   placebo    9.1   7.6   -1.5
23   M   placebo    9.9   8.0   -1.9
24   F   placebo    7.6   5.2   -2.4
25   F   placebo   11.8   9.7   -2.1
26   F   placebo   11.8  10.7   -1.1
27   F   aspirin   10.1   7.9   -2.2
28   M   aspirin   11.6   8.3   -3.3
29   F   aspirin   11.3   6.8   -4.5
30   F   placebo   10.3   8.3   -2.0
')
``````

### Using ddply

``````library(plyr)

# Run the functions length, mean, and sd on the value of "change" for each group,
# broken down by sex + condition
cdata <- ddply(data, c("sex", "condition"), summarise,
N    = length(change),
mean = mean(change),
sd   = sd(change),
se   = sd / sqrt(N)
)
cdata
#>   sex condition  N      mean        sd        se
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190
#> 4   M   placebo  4 -0.975000 0.7804913 0.3902456
``````

#### Handling missing data

If there are NA’s in the data, you need to pass the flag `na.rm=TRUE` to each of the functions. `length()` doesn’t take `na.rm` as an option, so one way to work around it is to use `sum(!is.na(...))` to count how many non-NA’s there are.

``````# Put some NA's in the data
dataNA <- data
dataNA\$change[11:14] <- NA

cdata <- ddply(dataNA, c("sex", "condition"), summarise,
N    = sum(!is.na(change)),
mean = mean(change, na.rm=TRUE),
sd   = sd(change, na.rm=TRUE),
se   = sd / sqrt(N)
)
cdata
#>   sex condition  N      mean        sd        se
#> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867
#> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713
#> 4   M   placebo  3 -1.300000 0.5291503 0.3055050
``````

#### A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here:

• Find the mean, standard deviation, and count (N)
• Find the standard error of the mean (again, this may not be what you want if you are collapsing over a within-subject variable. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.)
• Find a 95% confidence interval (or other value, if desired)
• Rename the columns so that the resulting data frame is easier to work with

To use, put this function in your code and call it as demonstrated below.

``````## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
library(plyr)

# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else       length(x)
}

# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N    = length2(xx[[col]], na.rm=na.rm),
mean = mean   (xx[[col]], na.rm=na.rm),
sd   = sd     (xx[[col]], na.rm=na.rm)
)
},
measurevar
)

# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))

datac\$se <- datac\$sd / sqrt(datac\$N)  # Calculate standard error of the mean

# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac\$N-1)
datac\$ci <- datac\$se * ciMult

return(datac)
}
``````

Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the `summarySE` function does it all in one step:

``````summarySE(data, measurevar="change", groupvars=c("sex", "condition"))
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4   M   placebo  4 -0.975000 0.7804913 0.3902456 1.2419358

# With a data set with NA's, use na.rm=TRUE
summarySE(dataNA, measurevar="change", groupvars=c("sex", "condition"), na.rm=TRUE)
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572 1.5879046
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713 0.9872588
#> 4   M   placebo  3 -1.300000 0.5291503 0.3055050 1.3144821
``````

#### Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with NA’s. To do this, set `.drop=FALSE` in the call to `ddply` or `summarySE`.

Example usage:

``````# First remove some all Male+Placebo entries from the data
dataSub <- subset(data, !(sex=="M" & condition=="placebo"))

# If we summarize the data, there will be a missing row for Male+Placebo,
# since there were no cases with this combination.
summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767

# Set .drop=FALSE to NOT drop those combinations
summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"), .drop=FALSE)
#> Warning in qt(conf.interval/2 + 0.5, datac\$N - 1): NaNs produced
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4   M   placebo  0       NaN        NA        NA        NA
``````

### Using summaryBy

To collapse the data using the `summarizeBy()` function:

``````library(doBy)

# Run the functions length, mean, and sd on the value of "change" for each group,
# broken down by sex + condition
cdata <- summaryBy(change ~ sex + condition, data=data, FUN=c(length,mean,sd))
cdata
#>   sex condition change.length change.mean change.sd
#> 1   F   aspirin             5   -3.420000 0.8642916
#> 2   F   placebo            12   -2.058333 0.5247655
#> 3   M   aspirin             9   -5.411111 1.1307569
#> 4   M   placebo             4   -0.975000 0.7804913

# Rename column change.length to just N
names(cdata)[names(cdata)=="change.length"] <- "N"

# Calculate standard error of the mean
cdata\$change.se <- cdata\$change.sd / sqrt(cdata\$N)
cdata
#>   sex condition  N change.mean change.sd change.se
#> 1   F   aspirin  5   -3.420000 0.8642916 0.3865230
#> 2   F   placebo 12   -2.058333 0.5247655 0.1514867
#> 3   M   aspirin  9   -5.411111 1.1307569 0.3769190
#> 4   M   placebo  4   -0.975000 0.7804913 0.3902456
``````

Note that if you have any within-subjects variables, these standard error values may not be useful for comparing groups. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.

#### Handling missing data

If there are NA’s in the data, you need to pass the flag `na.rm=TRUE` to the functions. Normally you could pass it to `summaryBy()` and it would get passed to each of the functions called, but `length()` does not recognize it and so it won’t work. One way around it is to define a new length function that handles the NA’s.

``````# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else       length(x)
}

# Put some NA's in the data
dataNA <- data
dataNA\$change[11:14] <- NA

cdataNA <- summaryBy(change ~ sex + condition, data=dataNA,
FUN=c(length2, mean, sd), na.rm=TRUE)
cdataNA
#>   sex condition change.length2 change.mean change.sd
#> 1   F   aspirin              4   -3.425000 0.9979145
#> 2   F   placebo             12   -2.058333 0.5247655
#> 3   M   aspirin              7   -5.142857 1.0674848
#> 4   M   placebo              3   -1.300000 0.5291503

# Now, do the same as before
``````

#### A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here:

• Find the mean, standard deviation, and count (N)
• Find the standard error of the mean (again, this may not be what you want if you are collapsing over a within-subject variable. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.)
• Find a 95% confidence interval (or other value, if desired)
• Rename the columns so that the resulting data frame is easier to work with

To use, put this function in your code and call it as demonstrated below.

``````## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and confidence
## interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
library(doBy)

# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else       length(x)
}

# Collapse the data
formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ "))
datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)

# Rename columns
names(datac)[ names(datac) == paste(measurevar, ".mean",    sep="") ] <- measurevar
names(datac)[ names(datac) == paste(measurevar, ".sd",      sep="") ] <- "sd"
names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"

datac\$se <- datac\$sd / sqrt(datac\$N)  # Calculate standard error of the mean

# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac\$N-1)
datac\$ci <- datac\$se * ciMult

return(datac)
}
``````

Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the `summarySE` function does it all in one step:

``````summarySE(data, measurevar="change", groupvars=c("sex","condition"))
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4   M   placebo  4 -0.975000 0.7804913 0.3902456 1.2419358

# With a data set with NA's, use na.rm=TRUE
summarySE(dataNA, measurevar="change", groupvars=c("sex","condition"), na.rm=TRUE)
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572 1.5879046
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713 0.9872588
#> 4   M   placebo  3 -1.300000 0.5291503 0.3055050 1.3144821
``````

#### Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with zeros.

This function will fill in those missing combinations with zeros:

``````fillMissingCombs <- function(df, factors, measures) {

# Make a list of the combinations of factor levels
levelList <- list()
for (f in factors) {  levelList[[f]] <- levels(df[,f])  }

fullFactors <- expand.grid(levelList)

dfFull <- merge(fullFactors, df, all.x=TRUE)

# Wherever there is an NA in the measure vars, replace with 0
for (m in measures) {
dfFull[is.na(dfFull[,m]), m] <- 0
}

return(dfFull)
}
``````

Example usage:

``````# First remove some all Male+Placebo entries from the data
dataSub <- subset(data, !(sex=="M" & condition=="placebo"))

# If we summarize the data, there will be a missing row for Male+Placebo,
# since there were no cases with this combination.
cdataSub <- summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
cdataSub
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767

# This will fill in the missing combinations with zeros
fillMissingCombs(cdataSub, factors=c("sex","condition"), measures=c("N","change","sd","se","ci"))
#>   sex condition  N    change        sd        se        ci
#> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4   M   placebo  0  0.000000 0.0000000 0.0000000 0.0000000
``````

### Using aggregate

The `aggregate` function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

``````# Get a count of number of subjects in each category (sex*condition)
cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
cdata
#>   sex condition subject
#> 1   F   aspirin       5
#> 2   M   aspirin       9
#> 3   F   placebo      12
#> 4   M   placebo       4

# Rename "subject" column to "N"
names(cdata)[names(cdata)=="subject"] <- "N"
cdata
#>   sex condition  N
#> 1   F   aspirin  5
#> 2   M   aspirin  9
#> 3   F   placebo 12
#> 4   M   placebo  4

# Sort by sex first
cdata <- cdata[order(cdata\$sex),]
cdata
#>   sex condition  N
#> 1   F   aspirin  5
#> 3   F   placebo 12
#> 2   M   aspirin  9
#> 4   M   placebo  4

# We also keep the __before__ and __after__ columns:
# Get the average effect size by sex and condition
cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")], FUN=mean)
cdata.means
#>   sex condition   before     after    change
#> 1   F   aspirin 11.06000  7.640000 -3.420000
#> 2   M   aspirin 11.26667  5.855556 -5.411111
#> 3   F   placebo 10.13333  8.075000 -2.058333
#> 4   M   placebo 11.47500 10.500000 -0.975000

# Merge the data frames
cdata <- merge(cdata, cdata.means)
cdata
#>   sex condition  N   before     after    change
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000
#> 2   F   placebo 12 10.13333  8.075000 -2.058333
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111
#> 4   M   placebo  4 11.47500 10.500000 -0.975000

# Get the sample (n-1) standard deviation for "change"
cdata.sd <- aggregate(data["change"],
by = data[c("sex","condition")], FUN=sd)
# Rename the column to change.sd
names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
cdata.sd
#>   sex condition change.sd
#> 1   F   aspirin 0.8642916
#> 2   M   aspirin 1.1307569
#> 3   F   placebo 0.5247655
#> 4   M   placebo 0.7804913

# Merge
cdata <- merge(cdata, cdata.sd)
cdata
#>   sex condition  N   before     after    change change.sd
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916
#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569
#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913

# Calculate standard error of the mean
cdata\$change.se <- cdata\$change.sd / sqrt(cdata\$N)
cdata
#>   sex condition  N   before     after    change change.sd change.se
#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
``````

If you have `NA`’s in your data and wish to skip them, use `na.rm=TRUE`:

``````cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")],
FUN=mean, na.rm=TRUE)
cdata.means
#>   sex condition   before     after    change
#> 1   F   aspirin 11.06000  7.640000 -3.420000
#> 2   M   aspirin 11.26667  5.855556 -5.411111
#> 3   F   placebo 10.13333  8.075000 -2.058333
#> 4   M   placebo 11.47500 10.500000 -0.975000
``````