Replace missing values (NA) with most recent non-NA by group

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:

``````houseID      year    price
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA
``````

I would like to make a data frame like this:

``````houseID      year    price
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44
``````

Here are some data in the right format:

``````# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))
``````

Is there a dplyr-way to do that?

85

These all use `na.locf` from the zoo package. Also note that `na.locf0` (also defined in zoo) is like `na.locf` except it defaults to `na.rm = FALSE` and requires a single vector argument. `na.locf2` defined in the first solution is also used in some of the others.

dplyr

``````library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
``````

giving:

``````Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44
``````

A variation of this is:

``````df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
``````

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the `by` solution (shown further below) with dplyr:

``````df %>% by(df\$houseID, na.locf2) %>% bind_rows
``````

by

``````library(zoo)

do.call(rbind, by(df, df\$houseID, na.locf2))
``````

ave

``````library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))
``````

data.table

``````library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]
``````

zoo This solution uses zoo alone. It returns a wide rather than long result:

``````library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
``````

giving:

``````       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
``````

This solution could be combined with dplyr like this:

``````library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
``````

input

Here is the input used for the examples above:

``````df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
``````

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out `na.locf2` from all solutions.

Tuesday, June 1, 2021

44

I think you can use `accumulate()` here to help. And i've also made a wrapper function to use for different thresholds

``````sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}

tib %>% mutate(c = sum_reset_at(5)(a))
#       t     a     c
#   <dbl> <dbl> <dbl>
# 1     1     2     2
# 2     2     3     5
# 3     3     1     1
# 4     4     2     3
# 5     5     2     5
# 6     6     3     3
tib %>% mutate(c = sum_reset_at(4)(a))
#       t     a     c
#   <dbl> <dbl> <dbl>
# 1     1     2     2
# 2     2     3     5
# 3     3     1     1
# 4     4     2     3
# 5     5     2     5
# 6     6     3     3
tib %>% mutate(c = sum_reset_at(6)(a))
#       t     a     c
#   <dbl> <dbl> <dbl>
# 1     1     2     2
# 2     2     3     5
# 3     3     1     6
# 4     4     2     2
# 5     5     2     4
# 6     6     3     7
``````
Saturday, June 19, 2021

11

Another alternative:

``````df <- sapply(df, as.character) # since your values are `factor`
df[is.na(df)] <- 0
``````

If you want blanks instead of zeroes

``````> df <- sapply(df, as.character)
> df[is.na(df)] <- " "
> df
class    Year1 Year2 Year3 Year4 Year5
[1,] "classA" "A"   "A"   "A"   "A"   "A"
[2,] " "      " "   " "   " "   " "   " "
[3,] "classB" "B"   "B"   "B"   "B"   "B"
``````

If you want a data.frame, then just use `as.data.drame`

``````> as.data.frame(df)
class Year1 Year2 Year3 Year4 Year5
1 classA     A     A     A     A     A
2
3 classB     B     B     B     B     B
``````
Wednesday, June 23, 2021

61

I would like to offer an alternative approach which will avoid copying the whole column (what both `Time[-n()]` and `replace` do) and allow modifying in place

``````library(data.table)
indx <- setDT(df)[, .I[.N], by = .(user_id, tag)]\$V1 # finding the last incidences per group
df[indx, Time := 0L] # modifying in place
df
#       user_id tag Time
#  1: 268096674   1    3
#  2: 268096674   1   10
#  3: 268096674   1    1
#  4: 268096674   1    0
#  5: 268096674   1    0
#  6: 268096674   2    0
#  7: 268096674   2    9
#  8: 268096674   2    0
#  9: 268096674   3    0
# 10: 268096674   3    0
``````
Thursday, August 12, 2021

17

You can fill forwards and backwards, then set the rows where they don't match to `NA`.

``````library(zoo)
library(dplyr)

df %>%
mutate_if(is.factor, as.character) %>%
group_by(ID) %>%
mutate(result = na.locf(with_missing, fromLast = T),
result = ifelse(result == na.locf(with_missing), result, NA))

#    ID with_missing desired_result result
# 1   1            a              a      a
# 2   1            a              a      a
# 3   1         <NA>              a      a
# 4   1         <NA>              a      a
# 5   1            a              a      a
# 6   1            a              a      a
# 7   2            a              a      a
# 8   2            a              a      a
# 9   2         <NA>           <NA>   <NA>
# 10  2            b              b      b
# 11  2            b              b      b
# 12  2            b              b      b
# 13  3            a              a      a
# 14  3         <NA>           <NA>   <NA>
# 15  3         <NA>           <NA>   <NA>
# 16  3         <NA>           <NA>   <NA>
# 17  3            c              c      c
# 18  3            c              c      c
# 19  4            b              b      b
# 20  4         <NA>           <NA>   <NA>
# 21  4            a              a      a
# 22  4            a              a      a
# 23  4            a              a      a
# 24  4            a              a      a
# 25  5            a              a      a
# 26  5         <NA>              a      a
# 27  5         <NA>              a      a
# 28  5         <NA>              a      a
# 29  5         <NA>              a      a
# 30  5            a              a      a
# 31  6            a              a      a
# 32  6            a              b      a
# 33  6         <NA>              b   <NA>
# 34  6            b              b      b
# 35  6            a              a      a
# 36  6            a              a      a
# 37  7            a              a      a
# 38  7            a              a      a
# 39  7         <NA>              a      a
# 40  7         <NA>              a      a
# 41  7            a              a      a
# 42  7            a              a      a
# 43  8            a              a      a
# 44  8            a              a      a
# 45  8         <NA>           <NA>   <NA>
# 46  8            b              b      b
# 47  8            b              b      b
# 48  8            b              b      b
# 49  9            a              a      a
# 50  9         <NA>           <NA>   <NA>
# 51  9         <NA>           <NA>   <NA>
# 52  9         <NA>           <NA>   <NA>
# 53  9            c              c      c
# 54  9            c              c      c
# 55 10            b              b      b
# 56 10         <NA>           <NA>   <NA>
# 57 10            a              a      a
# 58 10            a              a      a
# 59 10            a              a      a
# 60 10            a              a      a
``````
Wednesday, October 6, 2021