## Problem

You want to find sequences of identical values in a vector or factor.

## Solution

It is possible to search for sequences of identical values by simply iterating over a vector, but this is very slow in R. A much faster way to find sequences is to use the `rle()` function.

``````# Example data
v <- c("A","A","A", "B","B","B","B", NA,NA, "C","C", "B", "C","C","C")
v
#>   "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"

vr <- rle(v)
vr
#> Run Length Encoding
#>   lengths: int [1:7] 3 4 1 1 2 1 3
#>   values : chr [1:7] "A" "B" NA NA "C" "B" "C"
``````

The RLE coded data can be converted back to a vector with `inverse.rle()`.

``````inverse.rle(vr)
#>   "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"
``````

One issue that might be problematic is that each `NA` is treated as a run of length 1, even if the `NA`’s are next to each other. It is possible to work around this by replacing the `NA`’s with some special designated value. For numeric vectors, `Inf` or some other number can be used; for character vectors, any string may be used. Of course, the special value must not appear otherwise in the vector.

``````w <- v
w[is.na(w)] <- "ZZZ"
w
#>   "A"   "A"   "A"   "B"   "B"   "B"   "B"   "ZZZ" "ZZZ" "C"   "C"   "B"   "C"   "C"
#>  "C"

wr <- rle(w)
wr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"

# Replace the ZZZ's with NA in the RLE-coded data
wr\$values[ wr\$values=="ZZZ" ] <- NA
wr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" NA "C" "B" "C"

w2 <- inverse.rle(wr)
w2
#>   "A" "A" "A" "B" "B" "B" "B" NA  NA  "C" "C" "B" "C" "C" "C"
``````

### Working with factors

Even though factors are basically just integer vectors with some information about levels attached, the `rle()` function doesn’t work with factors. The solution is to manually convert the factor to an integer vector or a character vector. Using an integer vector is fast and memory-efficient, which may matter for large data sets, but it is difficult to interpret. Using a character vector is slower and requires more memory, but can be much easier to interpret.

``````# Suppose this is the factor we have to work with
f <- factor(v)
f
#>   A    A    A    B    B    B    B    <NA> <NA> C    C    B    C    C    C
#> Levels: A B C

# Store the levels in the factor.
# This isn't strictly necessary, but it is useful for preserving order of levels
f_levels <- levels(f)
f_levels
#>  "A" "B" "C"

fc <- as.character(f)
fc[ is.na(fc) ] <- "ZZZ"
fc
#>   "A"   "A"   "A"   "B"   "B"   "B"   "B"   "ZZZ" "ZZZ" "C"   "C"   "B"   "C"   "C"
#>  "C"

fr <- rle(fc)
fr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"

# Replace the ZZZ's with NA in the RLE-coded data
fr\$values[ fr\$values=="ZZZ" ] <- NA
fr
#> Run Length Encoding
#>   lengths: int [1:6] 3 4 2 2 1 3
#>   values : chr [1:6] "A" "B" NA "C" "B" "C"

# Invert RLE coding and convert back to a factor
f2 <- inverse.rle(fr)
f2 <- factor(f, levels=f_levels)
f2
#>   A    A    A    B    B    B    B    <NA> <NA> C    C    B    C    C    C
#> Levels: A B C
``````