scala - How to return boolean if column contains integer value instead of searching millions of records using spark dataframe -
var isintcontains = dataframe.filter(col(colname).rlike("^\\d+")).count() if (isintcontains > 0) { print("it contains integer value in column provided") }
where colname coumn name passed dynamically.
here, iterates rows upto last one, continuing if finds integer value. want write logic returns true/false if atleast 1 value in column integer.
this question. indeed it's not necessary scan entire dataset, because want break search if 1 integer has been found.
in dataframe api try:
var isintcontains:boolean = dataframe.filter(col(colname).rlike("^\\d+")).take(1).size>0
but i've found it's faster using rdd api:
var isintcontains : boolean = dataframe.rdd .mappartitions(rows => { rows.find(row => row.getas[string](colname).matches("^\\d+")) match { case some(_) => iterator(1) case none => iterator.empty } }).isempty
i've tried above using randomly generated alphanumeric numbers of lenght 5 (so chances quite low result integer)
val dataframe = sparkcontext.parallelize( (1 1000000) .map(_ => scala.util.random.alphanumeric.take(5).mkstring("")) ) .todf("i") .repartition(10) .cache
now if check dataframe valid integersusing solution (i.e. using count
) takes ~ 1.5s while takes 0.7s using first solution (dataframe) , 0.6s using second solution (rdd).
Comments
Post a Comment