decision tree - R caret train() underperforming on J48 compared to manual parameter setting -
i need optimize accuracy of c4.5 algorithm on churn dataset using rweka's implementation (j48()
). therefore using train()
function of caret package me determine optimal parameter settings (for m , c). tried validate result manually running j48()
parameters determined train()
. result surprising the manual run had better result.
that raises following questions:
- which parameters might different when manually executing
j48()
? - how can
train()
function provide similar or better result manual parameter setting? - or totally missing here?
i'm running following code:
library("rweka", lib.loc="~/r/win-library/3.3") library("caret", lib.loc="~/r/win-library/3.3") library("gmodels", lib.loc="~/r/win-library/3.3") set.seed(7331)
determine best c4.5 model j48 using train() package caret:
ctrl <- traincontrol(method="lgocv", p=0.8, seeds=na) grid <- expand.grid(.m=25*(1:15), .c=c(0.1,0.05,0.025,0.01,0.0075,0.005))
training model using full dataset "response_nochar":
rtrain <- train(churn~.,data=response_nochar,method="j48",na.action=na.pass,trcontrol=ctrl,tunegrid=grid)
returns rtrain$finalmodel prediction accuracy 0.6055 (and tree of size 3 2 leaves):
# accuracy used select optimal model using largest value. # final values used model c = 0.005 , m = 25.
there approx. 50 combinations 0.6055 accuracy, ranging given values of final model (m=325, c=0.1) (with 1 exception inbetween).
trying out parameter values manually j48:
# splitting training , test datasets, deriving full dataset "response_nochar" # similar/equal above splitting lgocv , p=0.8? response_sample <- sample(10000, 8000) response_train <- response_nochar[response_sample,] response_test <- response_nochar[-response_sample,] # setting parameters jctrl <- weka_control(m=25,c=0.005)
calculating model:
c45 <- j48(churn~.,data=response_train,na.action=na.pass,control=jctrl)
predict using test dataset:
pred_c45 <- predict(c45, newdata=response_test, na.action=na.pass)
model predicts accuracy 0.655 (and tree of size 25 13 leaves).
crosstable(response_test$churn, pred_c45, prop.chisq= false, prop.c= false, prop.r= false, dnn= c('actual churn','predicted churn'))
ps: dataset use contains 10000 records , target variable's distribution 50:50.
Comments
Post a Comment