How to grep any format of percentages from a file in r? -
i grep function extract percentages multiple files formatted differently. example, can written in following ways: (5%, 2.46%, 12.9%, 5 %, 2.46 %, 5 12.9 %, 5 percent, 2.46 percent, 5 per cent,...etc) , want make sure there @ least space in front , behind avoid extracting html codes, or things like:
<td width="97%"></td>
this code working wrong, thinking maybe there way place in placeholders asterisks below variety of numbers looking this:
txt<-trycatch(readlines(ds2[i,temp]), error = function(e) readlines(ds2[i,temp] )) t<-grep("**.**%", txt)
rather write single regex expression, may easier in multiple steps. using examples gave:
x <- c('5%', '2.46%', '12.9%', '5 %', '2.46 %', '5 12.9 %', '5 percent', '2.46 percent', '5 per cent', 'etc..', '<td width="97%"></td>') get_pct <- function(x) { x <- gsub('="[^"]+%"', '', x) x <- gsub('\\s*per\\s*cent|\\s*%', '%', x) is_pct <- grepl('\\d+(\\.\\d+)?', x) as.numeric(ifelse(is_pct, gsub('.*?(\\d+\\.?\\d*)%.*', '\\1\\2', x), na)) } f(x) [1] 5.00 2.46 12.90 5.00 2.46 12.90 5.00 2.46 5.00 na na
here's same thing step step
# eliminate percentages html tags x <- gsub('="[^"]+%"', '', x) x [1] "5%" "2.46%" "12.9%" "5 %" "2.46 %" "5 12.9 %" [7] "5 percent" "2.46 percent" "5 per cent" "etc.." "<td width></td>" # standardize % symbol x <- gsub('\\s*per\\s*cent|\\s*%', '%', x) x [1] "5%" "2.46%" "12.9%" "5%" "2.46%" "5 12.9%" [7] "5%" "2.46%" "5%" "etc.." "<td width></td>" # find percentages is_pct <- grepl('\\d+(\\.\\d+)?', x) # extract values x <- ifelse(is_pct, gsub('.*?(\\d+\\.?\\d*)%.*', '\\1\\2', x), na) as.numeric(x) [1] 5.00 2.46 12.90 5.00 2.46 12.90 5.00 2.46 5.00 na na
Comments
Post a Comment