Memory issues: Cluster-analysis with very large multi-scaled data in R using Gower distance and k-medoids -
i have large dataframe named 'data' 350000 rows , 138 columns use k-medoids clustering. using code page: http://dpmartin42.github.io/blogposts/r/cluster-mixed-types
this code:
packages <- c("dplyr", "islr", "cluster", "rtsne", "ggplot2") if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } rm(packages) library(dplyr) # data cleaning library(islr) # college dataset library(cluster) # gower similarity , pam library(rtsne) # t-sne plot library(ggplot2) # visualization data <- read.csv("data.csv", sep = ";") ## creation of dissimilarity matrix using "gower distance" mixed data ##types gower_dist <- daisy(data, metric = "gower", type = list()) gower_mat <- as.matrix(gower_dist) #write.table(gower_mat, file = "dissimilarity.csv") #summary(gower_dist) sil_width <- c(na) for(l in 2:8){ pam_fit <- pam(gower_dist, diss = true, k = l) sil_width[l] <- pam_fit$silinfo$avg.width } nclust <- which.max(sil_width) # identify index of highest value opt.value <- max(sil_width, na.rm = true) # identify highest value ncluster <- round(mean(nclust)) valcluster <- max(opt.value) ## start pam clustering n clusters pam_fit <- pam(gower_dist, diss = true, k = ncluster) pam_results <- data.sample %>% mutate(cluster = pam_fit$clustering) %>% group_by(cluster) %>% do(the_summary = summary(.)) #pam_results$the_summary #data.sample[pam_fit$medoids, ] tsne_obj <- rtsne(gower_dist, is_distance = true) tsne_data <- tsne_obj$y %>% data.frame() %>% setnames(c("x", "y")) %>% mutate(cluster = factor(pam_fit$clustering)) ggplot(aes(x = x, y = y), data = tsne_data) + geom_point(aes(color = cluster))
the steps want perform are:
1) create dissimilarity matrix using gower distance multi-scaled data
2) optimal number of clusters
3) perform k-medoids clustering
4) visualize clustering using rtsne visualization of multi-dimensional data
the code works fine data subset 10000 rows.
if try perform code on more rows memory issues. entire dataframe error: 'error: cannot allocate vector of size 506.9 gb' created @ step
gower_dist <- daisy(data.sample, metric = "gower", type = list(), warntype = false) # suppress warning regarding data type
i know creation of dissimilarity matrix needs lot of ram. question not coding methodology: there meaningful way create dissimilarity matrix , perform clustering on entire dataframe? thinking 2 alternatives:
option 1: create dissimilarity matrix iterative in steps of 1000 rows. not sure if makes sense matrix shows each row each row.
option 2: create loop steps data subsets of 1000 rows selected randomly , steps repeated many times until representative clustering reached. not sure if makes sense.
is possible perform code above on large datasets in r?
slink require linear memory. dbscan , optics, too.
dbscan bit tricky parameterize (which value of epsilon?), optics worth try. don't know if gower can indexed, accelerate algorithm.
but you'll hit same problem later in tsne!
what consider first work manageable subset only. then, once know works, can either use data (with dbscan, try using same epsilon, increasing minpts larger data size). or add remaining points same cluster nearest neighbor in sample.
Comments
Post a Comment