Converting data between wide and long format

Table of contents

Problem

You want to do convert data from a wide format to a long format.

Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.

Solution

There are two methods for this. One uses the functions melt() and dcast() from a package called reshape2. The other uses a function called reshape(), which is confusingly not part of the reshape2 package; it is part of the base install of R. The melt and cast methods are very powerful and simpler to use, so only they will be covered here.

You may run into some problems with the code below if you have a version of ggplot2 prior to 0.9.0. Make sure you have the latest version by running update.packages().

Sample data

These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.

origdata.wide <- read.table(header=T, text='
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
       3   F     9.5  13.1  13.8
       4   M    11.5  13.4  12.9
 ')
# Make sure the subject column is a factor
origdata.wide$subject <- factor(origdata.wide$subject)
origdata.long <- read.table(header=T, text='
 subject sex condition measurement
       1   M   control         7.9
       1   M     cond1        12.3
       1   M     cond2        10.7
       2   F   control         6.3
       2   F     cond1        10.6
       2   F     cond2        11.1
       3   F   control         9.5
       3   F     cond1        13.1
       3   F     cond2        13.8
       4   M   control        11.5
       4   M     cond1        13.4
       4   M     cond2        12.9
 ')
# Make sure the subject column is a factor
origdata.long$subject <- factor(origdata.long$subject)

From wide to long

Use melt from the reshape2 package:

origdata.wide
# subject sex control cond1 cond2
#       1   M     7.9  12.3  10.7
#       2   F     6.3  10.6  11.1
#       3   F     9.5  13.1  13.8
#       4   M    11.5  13.4  12.9

library(reshape2)

# Specify id.vars: the variables to keep but not split apart on
melt(origdata.wide, id.vars=c("subject","sex"))
# subject sex variable value
#       1   M  control   7.9
#       2   F  control   6.3
#       3   F  control   9.5
#       4   M  control  11.5
#       1   M    cond1  12.3
#       2   F    cond1  10.6
#       3   F    cond1  13.1
#       4   M    cond1  13.4
#       1   M    cond2  10.7
#       2   F    cond2  11.1
#       3   F    cond2  13.8
#       4   M    cond2  12.9

There are options for melt that can make the output a little easier to work with:

data.long <- melt(origdata.wide,
        # ID variables - all the variables to keep but not split apart on
    id.vars=c("subject","sex"),
        # The source columns
    measure.vars=c("control", "cond1", "cond2" ),
        # Name of the destination column that will identify the original
        # column that the measurement came from
    variable.name="condition",
    value.name="measurement"
    )
# subject sex condition measurement
#       1   M   control         7.9
#       2   F   control         6.3
#       3   F   control         9.5
#       4   M   control        11.5
#       1   M     cond1        12.3
#       2   F     cond1        10.6
#       3   F     cond1        13.1
#       4   M     cond1        13.4
#       1   M     cond2        10.7
#       2   F     cond2        11.1
#       3   F     cond2        13.8
#       4   M     cond2        12.9

If you leave out the measure.vars, melt will automatically use all the other variables as the id.vars. The reverse is true if you leave out id.vars.

If you don't specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement".

Optional: Rename the factor levels of the variable column.

# Rename factor names from "cond1" and "cond2" to "first" and "second"
levels(data.long$condition)[levels(data.long$condition)=="cond1"] <- "first"
levels(data.long$condition)[levels(data.long$condition)=="cond2"] <- "second"

# Sort by subject first, then by condition
data.long <- data.long[ order(data.long$subject, data.long$condition), ]
# subject sex condition measurement
#       1   M   control         7.9
#       1   M     first        12.3
#       1   M    second        10.7
#       2   F   control         6.3
#       2   F     first        10.6
#       2   F    second        11.1
#       3   F   control         9.5
#       3   F     first        13.1
#       3   F    second        13.8
#       4   M   control        11.5
#       4   M     first        13.4
#       4   M    second        12.9

From long to wide

The following code uses dcast to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast instead.

origdata.long
# subject sex condition measurement
#       1   M   control         7.9
#       1   M     cond1        12.3
#       1   M     cond2        10.7
#       2   F   control         6.3
#       2   F     cond1        10.6
#       2   F     cond2        11.1
#       3   F   control         9.5
#       3   F     cond1        13.1
#       3   F     cond2        13.8
#       4   M   control        11.5
#       4   M     cond1        13.4
#       4   M     cond2        12.9

# From the source:
# "subject" and "sex" are columns we want to keep the same
# "condition" is the column that contains the names of the new column to put things in
# "measurement" holds the measurements
library(reshape2)

data.wide <- dcast(origdata.long, subject + sex ~ condition, value.var="measurement")
# subject sex cond1 cond2 control
#       1   M  12.3  10.7     7.9
#       2   F  10.6  11.1     6.3
#       3   F  13.1  13.8     9.5
#       4   M  13.4  12.9    11.5

Optional: A few things to make the data look nicer.

# Rename cond1 to first, and cond2 to second
names(data.wide)[names(data.wide)=="cond1"] <- "first"
names(data.wide)[names(data.wide)=="cond2"] <- "second"

# Reorder the columns
data.wide <- data.wide[, c(1,2,5,3,4)]
# subject sex control first second
#       1   M     7.9  12.3   10.7
#       2   F     6.3  10.6   11.1
#       3   F     9.5  13.1   13.8
#       4   M    11.5  13.4   12.9

The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-orderedafterward.