r - How do I Remove Data From Non-Scaled Dataframe against a Scaled One -
i'm using r right i'm scaling original data, removing outliers z-score of 3 or more, , filtering out unscaled data contains non-outliers. want left data frame contains non-scaled numbers after removing outliers. these steps:
steps
1. create 2 data frames (x, y) of same data
2. scale x , leave y unscaled.
3. filter out rows have greater 3 z-score in x
4. currently, example, x may have 95,000 rows while y still has 100,000
5. truncate y based on unique column called row id, made sure unscaled in x. unique column me match remaining rows in x , rows in y.
6. y should have same number of rows x, data unscaled. x has scaled data.
at moment can't data unscaled. tried using unscale method or data frame comparison tools r complains cannot work on data frames of 2 different sizes. there workaround this?
tries
i've tried dataframe <- dataframe[dataframe$row %in% remainingrows]
left nothing in data frame.
provide data, has sensitive information, data frame long has unique row id won't change during scaling.
if understood correctly want do, i'm suggesting different approach. use 2 data.frames
that, if use dplyr
package, can within single line of code ... , presumably faster well.
first i'm generating data.frame
100k rows, has id
column (just 1:100000 sequence) , value
(random numbers).
here's code:
library(dplyr) #generate data x <- data.frame(id=1:100000,value=runif(100000,max=100)*runif(10000,max=100)) #take > head(x) id value 1 1 853.67941 2 2 632.17472 3 3 3089.60716 4 4 8448.89408 5 5 5307.75684 6 6 19.07485
to filter out outliers, i'm using dplyr
pipe, chain multiple operations pipe (%>%
) operator. first calculate zscore
, filter
observations zscore
bigger three, , drop zscore
column again go original format (of course can keep well):
xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>% filter(zscore < 3) %>% select(-matches('zscore'))
if @ rows, you'll see filtering worked
> cat('rows of x:',nrow(x),'- rows of xclean:',nrow(xclean)) rows of x: 100000 - rows of xclean: 99575
while data looks original data.frame
:
> head(xclean) id value 1 1 853.67941 2 2 632.17472 3 3 3089.60716 4 4 8448.89408 5 5 5307.75684 6 6 19.07485
finally, can see observations have been filtered out comparing id
s of 2 data.frame
s:
> head(x$id[!is.element(x$id,xclean$id)],50) [1] 68 90 327 467 750 957 1090 1584 1978 2106 2306 3415 3511 3801 3855 4051 [17] 4148 4244 4266 4511 4875 5262 5633 5944 5975 6116 6263 6631 6734 6773 7320 7577 [33] 7619 7731 7735 7889 8073 8141 8207 8966 9200 9369 9994 10123 10538 11046 11090 11183 [49] 11348 11371
edit:
of course, 2 data frames version possible:
y <- x # calculate zscore x$value <- (x$value - mean(x$value))/sd(x$value) #subset y y <- y[x$value<3,] # 100k rows > nrow(y) [1] 99623
edit2:
accounting multiple value columns:
#generate data set.seed(21) x <- data.frame(id=1:100000,value1=runif(100000,max=100)*runif(10000,max=100), value2=runif(100000,max=100)*runif(10000,max=100), value3=runif(100000,max=100)*runif(10000,max=100)) > head(x) id value1 value2 value3 1 1 2103.9228 5861.33650 713.885222 2 2 341.8342 3940.68674 578.072141 3 3 5346.2175 458.07089 1.577347 4 4 400.1950 5881.05129 3090.618355 5 5 7346.3321 4890.56501 8989.248186 6 6 5305.5105 38.93093 517.509465
the dplyr
solution:
# make sure got recent version of dplyr > packageversion('dplyr') [1] ‘0.7.2’ # define zscore function: zscore <- function(x){(x-mean(x))/sd(x)} # select variables (could manually c()) vars_to_process <- grep('value',colnames(x),value=t) # calculate zscores , filter xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(zs = zscore(.))) %>% filter_at(vars(matches('zs')),all_vars(.<3)) %>% select(-matches('zs')) > nrow(xclean) [1] 98832
now solution without dplyr
(instead of using 2 dataframes, i'll generate boolean index based on x
:
# select variables vars_to_process <- grep('value',colnames(x),value=t) # create index zs < 3 ix <- apply(x[vars_to_process],2,function(x) (x-mean(x))/sd(x) < 3) #filter rows xclean <- x[rowsums(ix) == length(vars_to_process),] > nrow(xclean) [1] 98832
Comments
Post a Comment