r - partykit minsize option drops branches that exceed minsize -

January 15, 2012

i'm using lmtree() function partykit partition data using linear regressions. regressions use weight, , want ensure each branch has minimum total weight, specify minsize option. instance, in following example tree has 2 branches instead of 3 because x1=="c" has small weight in own branch.

n <- 100 x <- rbind(   data.frame(tt=1:n, x1="a", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1="b", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1="c", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) ) x$x1 <- factor(x$x1) tr <- lmtree(y ~ tt | x1, data=x, weight=weight, minsize=150)  fitted party: [1] root |   [2] x1 in a: n = 200 |       (intercept)          tt  |         0.7724903   0.2002023  |   [3] x1 in b, c: n = 300 |       (intercept)          tt  |         0.5759213   0.4659592

i have real-world data unfortunately confidential leading behavior not understand. when not specify minsize builds tree 30 branches, in every branch total weight n large number. however, when specify minsize below total weight of every branch first tree result new tree many fewer branches. not have expected tree change @ because seems minsize not binding. there explanation result?

update

providing example

n <- 100 x <- rbind(   data.frame(tt=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) ) tr <- lmtree(y ~ tt | x1, data=x, weights = weight)  fitted party: [1] root |   [2] x1 <= 0.29787: n = 200 |       (intercept)          tt  |         0.8431985   0.1994021  |   [3] x1 > 0.29787 |   |   [4] x1 <= 0.69515: n = 200 |   |       (intercept)          tt  |   |         0.6346980   0.3995678  |   |   [5] x1 > 0.69515: n = 100 |   |       (intercept)          tt  |   |         0.4792462   0.5987472

now let's set minsize=150. tree no longer has splits though x1 <= 0.3 , x1 > 0.3 work.

tr <- lmtree(y ~ tt | x1, data=x, weights = weight, minsize=150)  fitted party: [1] root: n = 500     (intercept)          tt        0.6870078   0.3593374

two rules applied in mob() (the infrastructure underlying lmtree()) important in context may benefit more explicit discussion:

if mob() selects splitting variable @ stage not lead single admissible split (in terms of minimal node size), splitting stops @ point. in contrast ctree() performs split if significant test detected - if second-best variable non-significant. offer more granular control on - , have on our wishlist upcoming revision of package.
by default weights interpreted case weights, i.e., mob() thinks there w independent observations identical given one. thus, number of observations sum of weights. note affects significance tests sample size increases!

as main question: it's hard come explanation without reproducible example. agree partykit should behave in way describe - maybe there 1 important not obvious detail haven't noticed yet... if come small/simple artificial data set replicates problem.

update

as pointed out in comments: reproducible example in updated question. helped me track down bug in mob() in handling case weights. there error in computation of test statistic in presence of case weights, leading incorrect split variable selection , stopping criterion. have fixed bug , new partykit development version available r-forge @ https://r-forge.r-project.org/r/?group_id=261. (note, however, r-forge @ moment builds windows binaries r 3.3.x. if more recent windows version used, please use type = "source" install source package - , make sure have necessary rtools installed.)

in example set random seed exact reproducibility. weighted data set as:

set.seed(1) n <- 100 x <- rbind(   data.frame(tt=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)),   data.frame(tt=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) )

then weighted tree can fitted before. in particular example tree structure remains unaffected test statistics , p-values of parameter instability test in each node changes somewaht:

library("partykit") tr1 <- lmtree(y ~ tt | x1, data = x, weights = weight) plot(tr1)

adding minsize = 150 argument has expected effect of avoiding split in node 3.

tr2 <- lmtree(y ~ tt | x1, data = x, weights = weight, minsize = 150) plot(tr2)

to check latter right thing compare tree explicitly expanded data. thus, data regarded case weights here, can inflate data set repeating thos observations weights greater 1.

xw <- x[rep(1:nrow(x), x$weight), ] tr3 <- lmtree(y ~ tt | x1, data = xw, minsize = 150)

the resulting coefficients same (up small numerical differences):

all.equal(coef(tr2), coef(tr3)) ## [1] true

and, more importantly, test statistics , p-values in nodes same:

library("strucchange") all.equal(sctest(tr2), sctest(tr3)) ## [1] true

Search This Blog

RT

r - partykit minsize option drops branches that exceed minsize -

update

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

javascript - Replicate keyboard event with html button -