scala - How to return boolean if column contains integer value instead of searching millions of records using spark dataframe -


var isintcontains = dataframe.filter(col(colname).rlike("^\\d+")).count() if (isintcontains > 0) {     print("it contains integer value in column provided")  } 

where colname coumn name passed dynamically.

here, iterates rows upto last one, continuing if finds integer value. want write logic returns true/false if atleast 1 value in column integer.

this question. indeed it's not necessary scan entire dataset, because want break search if 1 integer has been found.

in dataframe api try:

var isintcontains:boolean =  dataframe.filter(col(colname).rlike("^\\d+")).take(1).size>0 

but i've found it's faster using rdd api:

var isintcontains : boolean = dataframe.rdd .mappartitions(rows => {   rows.find(row => row.getas[string](colname).matches("^\\d+")) match {    case some(_) => iterator(1)    case none => iterator.empty   } }).isempty 

i've tried above using randomly generated alphanumeric numbers of lenght 5 (so chances quite low result integer)

val dataframe =  sparkcontext.parallelize(   (1 1000000)    .map(_ => scala.util.random.alphanumeric.take(5).mkstring(""))   ) .todf("i") .repartition(10) .cache 

now if check dataframe valid integersusing solution (i.e. using count) takes ~ 1.5s while takes 0.7s using first solution (dataframe) , 0.6s using second solution (rdd).


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -