r - correlation matrix by group on a single column -
i have dataframe has single column want calculate correlation matrix for, group. each group has same number of rows, it's big dataframe don't want have cast wide due memory constraints. there way in r without having recast?
ex:
dt <- data.table(group=rep(1:100,each=100000), value=rnorm(100000*100)) some_corr_function_not_requiring_recast(dt, value, by=group)
should return 100x100 matrix of correlations
here's example smaller data , base r (without using data.table
).
#data set.seed(42) dt <- data.table(group=rep(1:5, each = 20), value = rnorm(20 * 5))
1
this works first obtaining list of unique elements group
, running cor
between value
corresponding pairs of unique group
.
groups = unique(dt$group) sapply(1:length(groups), function(i) sapply(1:length(groups), function(j) cor(x = dt$value[dt$group == groups[i]], y = dt$value[dt$group == groups[j]]))) # [,1] [,2] [,3] [,4] [,5] #[1,] 1.00000000 0.436949356 0.04324370 -0.03960938 0.281518699 #[2,] 0.43694936 1.000000000 0.03976509 -0.06555478 0.005944951 #[3,] 0.04324370 0.039765093 1.00000000 0.33289052 0.211291403 #[4,] -0.03960938 -0.065554780 0.33289052 1.00000000 -0.183091610 #[5,] 0.28151870 0.005944951 0.21129140 -0.18309161 1.000000000
2
another approach works without recasting requires splitting dt
list based on group
.
temp = split(dt, dt$group) sapply(1:length(temp), function(i) sapply(1:length(temp), function(j) cor(x = temp[[i]]$value, y = temp[[j]]$value))) # [,1] [,2] [,3] [,4] [,5] #[1,] 1.00000000 0.436949356 0.04324370 -0.03960938 0.281518699 #[2,] 0.43694936 1.000000000 0.03976509 -0.06555478 0.005944951 #[3,] 0.04324370 0.039765093 1.00000000 0.33289052 0.211291403 #[4,] -0.03960938 -0.065554780 0.33289052 1.00000000 -0.183091610 #[5,] 0.28151870 0.005944951 0.21129140 -0.18309161 1.000000000
Comments
Post a Comment