Operation on lists as elements of dataframe in R -
i have time series id, , list of dates @ event occurred. want know how many times event has happened given date within time series.
here sample dataframe:
id <- c(1,1,1,2,2,2,3,3,3) date <- c(2000,2001,2002) df <- data.frame(id,date) rand1 <- c(runif(5)*4+1999) rand2 <- c(runif(6)*4+1999) rand3 <- c(runif(100)*4+1999) df$events <- list(rand1, rand1, rand1, rand2, rand2, rand2,rand3, rand3, rand3 )
this code solve problem correctly:
for (i in c(1:9)){ print(i) df[i,]$past <- sum( df[i,]$events[[1]] < df[i,]$date) }
but seems wildly inefficient go line line through dataframe. real dataset has 4 million rows, need little more sensible.
here tried first: i'm not sure it's doing, ends creating elements of df$past2 integer.
df$past2 <- sum(df$events[[1]] < df$date)
resulting df:
id date events past past2 <dbl> <dbl> <list> <dbl> <int> 1 2000 <dbl [5]> 3 6 1 2001 <dbl [5]> 3 6 1 2002 <dbl [5]> 4 6 2 2000 <dbl [6]> 0 6 2 2001 <dbl [6]> 3 6 2 2002 <dbl [6]> 5 6 3 2000 <dbl [100]> 26 6 3 2001 <dbl [100]> 55 6 3 2002 <dbl [100]> 74 6
so,
1) df$past2
calculation doing?
2) there way kind of operation on lists elements of dataframe without going line line?
thanks.
the problem df$past2 df$events[[1]]
return df[1,]$df$events[[1]]
.
one solution problem split each row of dataframe list , use lapply:
df$past2 = unlist(lapply(split(df,seq(nrow(df))),function(x) sum(x$events[[1]]< x$date)))
however, because there data manipulation, not sure efficient 4 million lines dataframe. might need data.table
or dplyr
to find more efficient solution.
Comments
Post a Comment