# Finding sequences of identical values

## Problem

You want to find sequences of identical values in a vector or factor.

## Solution

It is possible to search for sequences of identical values by simply iterating over a vector, but this is very slow in R. A much faster way to find sequences is to use the `rle()`

function.

```
# Example data
v <- c("A","A","A", "B","B","B","B", NA,NA, "C","C", "B", "C","C","C")
v
#> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
vr <- rle(v)
vr
#> Run Length Encoding
#> lengths: int [1:7] 3 4 1 1 2 1 3
#> values : chr [1:7] "A" "B" NA NA "C" "B" "C"
```

The RLE coded data can be converted back to a vector with `inverse.rle()`

.

```
inverse.rle(vr)
#> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
```

One issue that might be problematic is that each `NA`

is treated as a run of length 1, even if the `NA`

’s are next to each other. It is possible to work around this by replacing the `NA`

’s with some special designated value. For numeric vectors, `Inf`

or some other number can be used; for character vectors, any string may be used. Of course, the special value must not appear otherwise in the vector.

```
w <- v
w[is.na(w)] <- "ZZZ"
w
#> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C"
#> [15] "C"
wr <- rle(w)
wr
#> Run Length Encoding
#> lengths: int [1:6] 3 4 2 2 1 3
#> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"
# Replace the ZZZ's with NA in the RLE-coded data
wr$values[ wr$values=="ZZZ" ] <- NA
wr
#> Run Length Encoding
#> lengths: int [1:6] 3 4 2 2 1 3
#> values : chr [1:6] "A" "B" NA "C" "B" "C"
w2 <- inverse.rle(wr)
w2
#> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
```

### Working with factors

Even though factors are basically just integer vectors with some information about levels attached, the `rle()`

function doesn’t work with factors. The solution is to manually convert the factor to an integer vector or a character vector. Using an integer vector is fast and memory-efficient, which may matter for large data sets, but it is difficult to interpret. Using a character vector is slower and requires more memory, but can be much easier to interpret.

```
# Suppose this is the factor we have to work with
f <- factor(v)
f
#> [1] A A A B B B B <NA> <NA> C C B C C C
#> Levels: A B C
# Store the levels in the factor.
# This isn't strictly necessary, but it is useful for preserving order of levels
f_levels <- levels(f)
f_levels
#> [1] "A" "B" "C"
fc <- as.character(f)
fc[ is.na(fc) ] <- "ZZZ"
fc
#> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C"
#> [15] "C"
fr <- rle(fc)
fr
#> Run Length Encoding
#> lengths: int [1:6] 3 4 2 2 1 3
#> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"
# Replace the ZZZ's with NA in the RLE-coded data
fr$values[ fr$values=="ZZZ" ] <- NA
fr
#> Run Length Encoding
#> lengths: int [1:6] 3 4 2 2 1 3
#> values : chr [1:6] "A" "B" NA "C" "B" "C"
# Invert RLE coding and convert back to a factor
f2 <- inverse.rle(fr)
f2 <- factor(f, levels=f_levels)
f2
#> [1] A A A B B B B <NA> <NA> C C B C C C
#> Levels: A B C
```