Finding sequences of identical values

Problem
Solution
- Working with factors

Problem

You want to find sequences of identical values in a vector or factor.

Solution

It is possible to search for sequences of identical values by simply iterating over a vector, but this is very slow in R. A much faster way to find sequences is to use the rle() function.

# Example data
v <- c("A","A","A", "B","B","B","B", NA,NA, "C","C", "B", "C","C","C")
v
#>  [1] "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"

vr <- rle(v)
vr
#> Run Length Encoding
#>   lengths: int [1:7] 3 4 1 1 2 1 3
#>   values : chr [1:7] "A" "B" NA NA "C" "B" "C"

The RLE coded data can be converted back to a vector with inverse.rle().

inverse.rle(vr)
#>  [1] "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"

One issue that might be problematic is that each NA is treated as a run of length 1, even if the NA’s are next to each other. It is possible to work around this by replacing the NA’s with some special designated value. For numeric vectors, Inf or some other number can be used; for character vectors, any string may be used. Of course, the special value must not appear otherwise in the vector.

w <- v
w[is.na(w)] <- "ZZZ"
w
#>  [1] "A"   "A"   "A"   "B"   "B"   "B"   "B"   "ZZZ" "ZZZ" "C"   "C"   "B"   "C"   "C"  
#> [15] "C"

wr <- rle(w)
wr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"

# Replace the ZZZ's with NA in the RLE-coded data
wr$values[ wr$values=="ZZZ" ] <- NA
wr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" NA "C" "B" "C"

w2 <- inverse.rle(wr)
w2
#>  [1] "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"

Working with factors

Even though factors are basically just integer vectors with some information about levels attached, the rle() function doesn’t work with factors. The solution is to manually convert the factor to an integer vector or a character vector. Using an integer vector is fast and memory-efficient, which may matter for large data sets, but it is difficult to interpret. Using a character vector is slower and requires more memory, but can be much easier to interpret.

# Suppose this is the factor we have to work with
f <- factor(v)
f
#>  [1] A    A    A    B    B    B    B    <NA> <NA> C    C    B    C    C    C   
#> Levels: A B C

# Store the levels in the factor.
# This isn't strictly necessary, but it is useful for preserving order of levels
f_levels <- levels(f)
f_levels
#> [1] "A" "B" "C"

fc <- as.character(f)
fc[ is.na(fc) ] <- "ZZZ"
fc
#>  [1] "A"   "A"   "A"   "B"   "B"   "B"   "B"   "ZZZ" "ZZZ" "C"   "C"   "B"   "C"   "C"  
#> [15] "C"

fr <- rle(fc)
fr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"

# Replace the ZZZ's with NA in the RLE-coded data
fr$values[ fr$values=="ZZZ" ] <- NA
fr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" NA "C" "B" "C"

# Invert RLE coding and convert back to a factor
f2 <- inverse.rle(fr)
f2 <- factor(f, levels=f_levels)
f2
#>  [1] A    A    A    B    B    B    B    <NA> <NA> C    C    B    C    C    C   
#> Levels: A B C