r - partykit minsize option drops branches that exceed minsize -
i'm using lmtree()
function partykit
partition data using linear regressions. regressions use weight, , want ensure each branch has minimum total weight, specify minsize
option. instance, in following example tree has 2 branches instead of 3 because x1=="c"
has small weight in own branch.
n <- 100 x <- rbind( data.frame(tt=1:n, x1="a", weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1="b", weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1="c", weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) ) x$x1 <- factor(x$x1) tr <- lmtree(y ~ tt | x1, data=x, weight=weight, minsize=150) fitted party: [1] root | [2] x1 in a: n = 200 | (intercept) tt | 0.7724903 0.2002023 | [3] x1 in b, c: n = 300 | (intercept) tt | 0.5759213 0.4659592
i have real-world data unfortunately confidential leading behavior not understand. when not specify minsize
builds tree 30 branches, in every branch total weight n
large number. however, when specify minsize
below total weight of every branch first tree result new tree many fewer branches. not have expected tree change @ because seems minsize
not binding. there explanation result?
update
providing example
n <- 100 x <- rbind( data.frame(tt=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) ) tr <- lmtree(y ~ tt | x1, data=x, weights = weight) fitted party: [1] root | [2] x1 <= 0.29787: n = 200 | (intercept) tt | 0.8431985 0.1994021 | [3] x1 > 0.29787 | | [4] x1 <= 0.69515: n = 200 | | (intercept) tt | | 0.6346980 0.3995678 | | [5] x1 > 0.69515: n = 100 | | (intercept) tt | | 0.4792462 0.5987472
now let's set minsize=150
. tree no longer has splits though x1 <= 0.3
, x1 > 0.3
work.
tr <- lmtree(y ~ tt | x1, data=x, weights = weight, minsize=150) fitted party: [1] root: n = 500 (intercept) tt 0.6870078 0.3593374
two rules applied in mob()
(the infrastructure underlying lmtree()
) important in context may benefit more explicit discussion:
if
mob()
selects splitting variable @ stage not lead single admissible split (in terms of minimal node size), splitting stops @ point. in contrastctree()
performs split if significant test detected - if second-best variable non-significant. offer more granular control on - , have on our wishlist upcoming revision of package.by default
weights
interpreted case weights, i.e.,mob()
thinks therew
independent observations identical given one. thus, number of observations sum of weights. note affects significance tests sample size increases!
as main question: it's hard come explanation without reproducible example. agree partykit
should behave in way describe - maybe there 1 important not obvious detail haven't noticed yet... if come small/simple artificial data set replicates problem.
update
as pointed out in comments: reproducible example in updated question. helped me track down bug in mob()
in handling case weights. there error in computation of test statistic in presence of case weights, leading incorrect split variable selection , stopping criterion. have fixed bug , new partykit
development version available r-forge @ https://r-forge.r-project.org/r/?group_id=261. (note, however, r-forge @ moment builds windows binaries r 3.3.x. if more recent windows version used, please use type = "source"
install source package - , make sure have necessary rtools installed.)
in example set random seed exact reproducibility. weighted data set as:
set.seed(1) n <- 100 x <- rbind( data.frame(tt=1:n, x1=runif(n, 0.0, 0.3), weight=2, y=seq(1,l=n,by=0.2)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1=runif(n, 0.3, 0.7), weight=2, y=seq(1,l=n,by=0.4)+rnorm(n,sd=.2)), data.frame(tt=1:n, x1=runif(n, 0.7, 1.0), weight=1, y=seq(1,l=n,by=0.6)+rnorm(n,sd=.2)) )
then weighted tree can fitted before. in particular example tree structure remains unaffected test statistics , p-values of parameter instability test in each node changes somewaht:
library("partykit") tr1 <- lmtree(y ~ tt | x1, data = x, weights = weight) plot(tr1)
adding minsize = 150
argument has expected effect of avoiding split in node 3.
tr2 <- lmtree(y ~ tt | x1, data = x, weights = weight, minsize = 150) plot(tr2)
to check latter right thing compare tree explicitly expanded data. thus, data regarded case weights here, can inflate data set repeating thos observations weights greater 1.
xw <- x[rep(1:nrow(x), x$weight), ] tr3 <- lmtree(y ~ tt | x1, data = xw, minsize = 150)
the resulting coefficients same (up small numerical differences):
all.equal(coef(tr2), coef(tr3)) ## [1] true
and, more importantly, test statistics , p-values in nodes same:
library("strucchange") all.equal(sctest(tr2), sctest(tr3)) ## [1] true
Comments
Post a Comment