R: find two strings most commonly found together per category -

February 15, 2013

i have data frame (df) 3 columns: id number, category, , brand:

id             category        brand 00129          bits            b89 00129          bits            b87 00129          bits            b87 00129          logs            b32 00129          logs            b27 00129          logs            b27 00130          bits            b12 00130          bits            b14 00130          bits            b14 00131          logs            b32 00131          logs            b27 00131          logs            b32 00132          bits            b77 00132          bits            b89 00132          bits            b89

i have 200 different categories , 2000 different brands.

i want find 2 brands per category bought id numbers:

category       brand bits           b89,b87 logs           b32,b27

or:

#$bits     #[1] "b89" "b87"  #$logs     #[1] "b32" "b27"

the way think of rework data frame make sure calculated acknowledgment of different id numbers:

     b89   b87   b32   b27   b12   b14   1    1     2     1     2     0     0 2    0     0     0     0     1     2 3    0     0     2     1     0     0 4    2     1     0     0     0     0

and return columns populated values greater 0 when column populated values greater 0.

list1 =(setnames(object = lapply(1:ncol(df), function(i)   unique(colnames(df)[-i][which(as.matrix(df[which(df[,i] > 0),i])>0,                                    arr.ind = true)[,2]])),   nm = colnames(df)))

but sacrifice category, need. thoughts on how tackle this?

this might trick. ended combination of data.table , dplyr, because not familiar data.table yet.

dt = data.table(read.table(text="id             category              brand 00129          bits            b89 00129          bits            b87 00129          bits            b87 00129          logs            b32 00129          logs            b27 00129          logs            b27 00130          bits            b12 00130          bits            b14 00130          bits            b14 00131          logs            b32 00131          logs            b27 00131          logs            b32 00132          bits            b77 00132          bits            b89 00132          bits            b89",header=t))  library(data.table) library(dplyr)  # combinations of 2 purchases. dt = dt[,.(list(unique(brand))),.(id,category)][, .(combn(unlist(v1), 2,simplify=false)),.(id,category)]  # concatenate 2 purchases string dt$v1 = unlist(lapply(dt$v1,function(x) {paste(x,collapse=", ")}))  # fetch top per category dt %>% group_by(v1,category) %>% summarize(n=n()) %>% group_by(category) %>% top_n(n = 1) %>% select(-n)

output:

        v1 category 1 b12, b14     bits 2 b32, b27     logs 3 b77, b89     bits 4 b89, b87     bits

which think correct, considering dataset, although not match expected output?

optionally add

dt %>% group_by(id,category) %>% mutate(unique_types = n_distinct(brand)) %>% filter(unique_types>1)

in front if there purchases single brand, since combn(n,m) not work if length(n)<m

Search This Blog

RT

R: find two strings most commonly found together per category -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

node.js - Node js - Trying to send POST request, but it is not loading javascript content -