scala - Break big spark sql query into smaller queries and merge it -
i have big spark sql statement i'm trying break smaller chunks better code readability. not want join merge result.
current working sql statement-
val dfs = x.map(field => spark.sql(s" select ‘test’ table_name, '$field' column_name, min($field) min_value, max($field) max_value, approx_count_distinct($field) unique_value_count, ( select 100 * approx_count_distinct($field)/count(1) tempdftable ) perc tempdftable ”))
i'm trying take below query out of above sql
(select 100 * approx_count_distinct($field)/count(1) tempdftable) perc
with logic -
val perce = x.map(field => spark.sql(s"(select 100 * approx_count_distinct($field)/count(1) parquetdftable)"))
and later merge val perce first big sql statement below statement not working -
val dfs = x.map(field => spark.sql(s" select ‘test’ table_name, '$field' column_name, min($field) min_value, max($field) max_value, approx_count_distinct($field) unique_value_count, '"+perce+ "' tempdftable ”))
how write ?
why not go in , convert entire expression spark code?
import spark.implicits._ import org.apache.spark.sql.functions._ val fraction = udf((approxcount: double, totalcount: double) => 100 * approxcount/totalcount) val fields = seq("cola", "colb", "colc") val dfs = fields.map(field => { tempdftable .select(min(field) "min_value", max(field) "max_value", approx_count_distinct(field) "unique_value_count", count(field) "total_count") .withcolumn("table_name", lit("test")) .withcolumn("column_name", lit(field)) .withcolumn("perc", fraction('unique_value_count, 'total_count)) .select('table_name, 'column_name, 'min_value, 'max_value, 'unique_value_count, 'perc) }) val df = dfs.reduce(_ union _)
on test example this:
val tempdftable = spark.sparkcontext.parallelize(list((3.0, 7.0, 2.0), (1.0, 4.0, 10.0), (3.0, 7.0, 2.0), (5.0, 0.0, 2.0))).todf("cola", "colb", "colc") tempdftable.show +----+----+----+ |cola|colb|colc| +----+----+----+ | 3.0| 7.0| 2.0| | 1.0| 4.0|10.0| | 3.0| 7.0| 2.0| | 5.0| 0.0| 2.0| +----+----+----+
we get
df.show +----------+-----------+---------+---------+------------------+----+ |table_name|column_name|min_value|max_value|unique_value_count|perc| +----------+-----------+---------+---------+------------------+----+ | test| cola| 1.0| 5.0| 3|75.0| | test| colb| 0.0| 7.0| 3|75.0| | test| colc| 2.0| 10.0| 2|50.0| +----------+-----------+---------+---------+------------------+----+
Comments
Post a Comment