R: find two strings most commonly found together per category -
i have data frame (df) 3 columns: id number, category, , brand:
id category brand 00129 bits b89 00129 bits b87 00129 bits b87 00129 logs b32 00129 logs b27 00129 logs b27 00130 bits b12 00130 bits b14 00130 bits b14 00131 logs b32 00131 logs b27 00131 logs b32 00132 bits b77 00132 bits b89 00132 bits b89
i have 200 different categories , 2000 different brands.
i want find 2 brands per category bought id numbers:
category brand bits b89,b87 logs b32,b27
or:
#$bits #[1] "b89" "b87" #$logs #[1] "b32" "b27"
the way think of rework data frame make sure calculated acknowledgment of different id numbers:
b89 b87 b32 b27 b12 b14 1 1 2 1 2 0 0 2 0 0 0 0 1 2 3 0 0 2 1 0 0 4 2 1 0 0 0 0
and return columns populated values greater 0 when column populated values greater 0.
list1 =(setnames(object = lapply(1:ncol(df), function(i) unique(colnames(df)[-i][which(as.matrix(df[which(df[,i] > 0),i])>0, arr.ind = true)[,2]])), nm = colnames(df)))
but sacrifice category, need. thoughts on how tackle this?
this might trick. ended combination of data.table , dplyr, because not familiar data.table yet.
dt = data.table(read.table(text="id category brand 00129 bits b89 00129 bits b87 00129 bits b87 00129 logs b32 00129 logs b27 00129 logs b27 00130 bits b12 00130 bits b14 00130 bits b14 00131 logs b32 00131 logs b27 00131 logs b32 00132 bits b77 00132 bits b89 00132 bits b89",header=t)) library(data.table) library(dplyr) # combinations of 2 purchases. dt = dt[,.(list(unique(brand))),.(id,category)][, .(combn(unlist(v1), 2,simplify=false)),.(id,category)] # concatenate 2 purchases string dt$v1 = unlist(lapply(dt$v1,function(x) {paste(x,collapse=", ")})) # fetch top per category dt %>% group_by(v1,category) %>% summarize(n=n()) %>% group_by(category) %>% top_n(n = 1) %>% select(-n)
output:
v1 category 1 b12, b14 bits 2 b32, b27 logs 3 b77, b89 bits 4 b89, b87 bits
which think correct, considering dataset, although not match expected output?
optionally add
dt %>% group_by(id,category) %>% mutate(unique_types = n_distinct(brand)) %>% filter(unique_types>1)
in front if there purchases single brand, since combn(n,m)
not work if length(n)<m
Comments
Post a Comment