Operation on lists as elements of dataframe in R -
i have time series id, , list of dates @ event occurred. want know how many times event has happened given date within time series.
here sample dataframe:
id <- c(1,1,1,2,2,2,3,3,3) date <- c(2000,2001,2002) df <- data.frame(id,date) rand1 <- c(runif(5)*4+1999) rand2 <- c(runif(6)*4+1999) rand3 <- c(runif(100)*4+1999) df$events <- list(rand1, rand1, rand1, rand2, rand2, rand2,rand3, rand3, rand3 ) this code solve problem correctly:
for (i in c(1:9)){ print(i) df[i,]$past <- sum( df[i,]$events[[1]] < df[i,]$date) } but seems wildly inefficient go line line through dataframe. real dataset has 4 million rows, need little more sensible.
here tried first: i'm not sure it's doing, ends creating elements of df$past2 integer.
df$past2 <- sum(df$events[[1]] < df$date) resulting df:
id date events past past2 <dbl> <dbl> <list> <dbl> <int> 1 2000 <dbl [5]> 3 6 1 2001 <dbl [5]> 3 6 1 2002 <dbl [5]> 4 6 2 2000 <dbl [6]> 0 6 2 2001 <dbl [6]> 3 6 2 2002 <dbl [6]> 5 6 3 2000 <dbl [100]> 26 6 3 2001 <dbl [100]> 55 6 3 2002 <dbl [100]> 74 6 so,
1) df$past2 calculation doing?
2) there way kind of operation on lists elements of dataframe without going line line?
thanks.
the problem df$past2 df$events[[1]] return df[1,]$df$events[[1]].
one solution problem split each row of dataframe list , use lapply:
df$past2 = unlist(lapply(split(df,seq(nrow(df))),function(x) sum(x$events[[1]]< x$date))) however, because there data manipulation, not sure efficient 4 million lines dataframe. might need data.table or dplyrto find more efficient solution.
Comments
Post a Comment