r - How do I Remove Data From Non-Scaled Dataframe against a Scaled One -

September 15, 2014

i'm using r right i'm scaling original data, removing outliers z-score of 3 or more, , filtering out unscaled data contains non-outliers. want left data frame contains non-scaled numbers after removing outliers. these steps:

steps
1. create 2 data frames (x, y) of same data
2. scale x , leave y unscaled.
3. filter out rows have greater 3 z-score in x
4. currently, example, x may have 95,000 rows while y still has 100,000
5. truncate y based on unique column called row id, made sure unscaled in x. unique column me match remaining rows in x , rows in y.
6. y should have same number of rows x, data unscaled. x has scaled data.

at moment can't data unscaled. tried using unscale method or data frame comparison tools r complains cannot work on data frames of 2 different sizes. there workaround this?

tries
i've tried dataframe <- dataframe[dataframe$row %in% remainingrows] left nothing in data frame.

provide data, has sensitive information, data frame long has unique row id won't change during scaling.

if understood correctly want do, i'm suggesting different approach. use 2 data.frames that, if use dplyrpackage, can within single line of code ... , presumably faster well.

first i'm generating data.frame 100k rows, has id column (just 1:100000 sequence) , value (random numbers).

here's code:

library(dplyr)  #generate data x <- data.frame(id=1:100000,value=runif(100000,max=100)*runif(10000,max=100))  #take  > head(x)   id      value 1  1  853.67941 2  2  632.17472 3  3 3089.60716 4  4 8448.89408 5  5 5307.75684 6  6   19.07485

to filter out outliers, i'm using dplyr pipe, chain multiple operations pipe (%>%) operator. first calculate zscore, filter observations zscore bigger three, , drop zscore column again go original format (of course can keep well):

xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>%  filter(zscore < 3) %>% select(-matches('zscore'))

if @ rows, you'll see filtering worked

> cat('rows of x:',nrow(x),'- rows of xclean:',nrow(xclean)) rows of x: 100000 - rows of xclean: 99575

while data looks original data.frame:

> head(xclean)   id      value 1  1  853.67941 2  2  632.17472 3  3 3089.60716 4  4 8448.89408 5  5 5307.75684 6  6   19.07485

finally, can see observations have been filtered out comparing ids of 2 data.frames:

> head(x$id[!is.element(x$id,xclean$id)],50)  [1]    68    90   327   467   750   957  1090  1584  1978  2106  2306  3415  3511  3801  3855  4051 [17]  4148  4244  4266  4511  4875  5262  5633  5944  5975  6116  6263  6631  6734  6773  7320  7577 [33]  7619  7731  7735  7889  8073  8141  8207  8966  9200  9369  9994 10123 10538 11046 11090 11183 [49] 11348 11371

edit:

of course, 2 data frames version possible:

y <- x  # calculate zscore x$value <- (x$value - mean(x$value))/sd(x$value)  #subset y y <- y[x$value<3,]  # 100k rows > nrow(y) [1] 99623

edit2:

accounting multiple value columns:

#generate data set.seed(21) x <- data.frame(id=1:100000,value1=runif(100000,max=100)*runif(10000,max=100),                 value2=runif(100000,max=100)*runif(10000,max=100),                 value3=runif(100000,max=100)*runif(10000,max=100))      > head(x)   id    value1     value2      value3 1  1 2103.9228 5861.33650  713.885222 2  2  341.8342 3940.68674  578.072141 3  3 5346.2175  458.07089    1.577347 4  4  400.1950 5881.05129 3090.618355 5  5 7346.3321 4890.56501 8989.248186 6  6 5305.5105   38.93093  517.509465

the dplyr solution:

# make sure got recent version of dplyr > packageversion('dplyr') [1] ‘0.7.2’  # define zscore function: zscore <- function(x){(x-mean(x))/sd(x)}  # select variables (could manually c()) vars_to_process <-  grep('value',colnames(x),value=t)   # calculate zscores , filter  xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(zs = zscore(.))) %>%                 filter_at(vars(matches('zs')),all_vars(.<3)) %>%                  select(-matches('zs'))  > nrow(xclean) [1] 98832

now solution without dplyr (instead of using 2 dataframes, i'll generate boolean index based on x:

# select variables vars_to_process <-  grep('value',colnames(x),value=t)  # create index zs < 3 ix <- apply(x[vars_to_process],2,function(x) (x-mean(x))/sd(x) < 3)  #filter rows xclean <- x[rowsums(ix) == length(vars_to_process),]  > nrow(xclean) [1] 98832

Search This Blog

RT