decision tree - R caret train() underperforming on J48 compared to manual parameter setting -


i need optimize accuracy of c4.5 algorithm on churn dataset using rweka's implementation (j48()). therefore using train() function of caret package me determine optimal parameter settings (for m , c). tried validate result manually running j48() parameters determined train(). result surprising the manual run had better result.

that raises following questions:

  • which parameters might different when manually executing j48()?
  • how can train() function provide similar or better result manual parameter setting?
  • or totally missing here?

i'm running following code:

library("rweka", lib.loc="~/r/win-library/3.3") library("caret", lib.loc="~/r/win-library/3.3") library("gmodels", lib.loc="~/r/win-library/3.3")  set.seed(7331) 

determine best c4.5 model j48 using train() package caret:

ctrl <- traincontrol(method="lgocv", p=0.8, seeds=na) grid <- expand.grid(.m=25*(1:15), .c=c(0.1,0.05,0.025,0.01,0.0075,0.005)) 

training model using full dataset "response_nochar":

rtrain <- train(churn~.,data=response_nochar,method="j48",na.action=na.pass,trcontrol=ctrl,tunegrid=grid) 

returns rtrain$finalmodel prediction accuracy 0.6055 (and tree of size 3 2 leaves):

# accuracy used select optimal model using  largest value. # final values used model c = 0.005 , m = 25. 

there approx. 50 combinations 0.6055 accuracy, ranging given values of final model (m=325, c=0.1) (with 1 exception inbetween).

trying out parameter values manually j48:

# splitting training , test datasets, deriving full dataset "response_nochar" # similar/equal above splitting lgocv , p=0.8? response_sample <- sample(10000, 8000) response_train <- response_nochar[response_sample,] response_test <- response_nochar[-response_sample,] # setting parameters jctrl <- weka_control(m=25,c=0.005) 

calculating model:

c45 <- j48(churn~.,data=response_train,na.action=na.pass,control=jctrl) 

predict using test dataset:

pred_c45 <- predict(c45, newdata=response_test, na.action=na.pass) 

model predicts accuracy 0.655 (and tree of size 25 13 leaves).

crosstable(response_test$churn, pred_c45, prop.chisq= false, prop.c= false, prop.r= false, dnn= c('actual churn','predicted churn')) 

ps: dataset use contains 10000 records , target variable's distribution 50:50.


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -