scala - Break big spark sql query into smaller queries and merge it -

September 15, 2013

i have big spark sql statement i'm trying break smaller chunks better code readability. not want join merge result.

current working sql statement-

val dfs = x.map(field => spark.sql(s"    select ‘test’ table_name,           '$field' column_name,            min($field) min_value,            max($field) max_value,            approx_count_distinct($field) unique_value_count,           (             select 100 * approx_count_distinct($field)/count(1)              tempdftable           ) perc     tempdftable ”))

i'm trying take below query out of above sql

(select 100 * approx_count_distinct($field)/count(1) tempdftable) perc

with logic -

 val perce = x.map(field => spark.sql(s"(select 100 * approx_count_distinct($field)/count(1) parquetdftable)"))

and later merge val perce first big sql statement below statement not working -

val dfs = x.map(field => spark.sql(s"   select ‘test’ table_name,          '$field' column_name,           min($field) min_value,           max($field) max_value,           approx_count_distinct($field) unique_value_count,          '"+perce+ "'   tempdftable ”))

how write ?

why not go in , convert entire expression spark code?

import spark.implicits._ import org.apache.spark.sql.functions._  val fraction = udf((approxcount: double, totalcount: double) => 100 * approxcount/totalcount)  val fields = seq("cola", "colb", "colc")  val dfs = fields.map(field => {   tempdftable     .select(min(field) "min_value", max(field) "max_value", approx_count_distinct(field) "unique_value_count", count(field) "total_count")     .withcolumn("table_name", lit("test"))     .withcolumn("column_name", lit(field))     .withcolumn("perc", fraction('unique_value_count, 'total_count))     .select('table_name, 'column_name, 'min_value, 'max_value, 'unique_value_count, 'perc) })  val df = dfs.reduce(_ union _)

on test example this:

val tempdftable = spark.sparkcontext.parallelize(list((3.0, 7.0, 2.0), (1.0, 4.0, 10.0), (3.0, 7.0, 2.0), (5.0, 0.0, 2.0))).todf("cola", "colb", "colc")  tempdftable.show  +----+----+----+ |cola|colb|colc| +----+----+----+ | 3.0| 7.0| 2.0| | 1.0| 4.0|10.0| | 3.0| 7.0| 2.0| | 5.0| 0.0| 2.0| +----+----+----+

we get

df.show  +----------+-----------+---------+---------+------------------+----+ |table_name|column_name|min_value|max_value|unique_value_count|perc| +----------+-----------+---------+---------+------------------+----+ |      test|       cola|      1.0|      5.0|                 3|75.0| |      test|       colb|      0.0|      7.0|                 3|75.0| |      test|       colc|      2.0|     10.0|                 2|50.0| +----------+-----------+---------+---------+------------------+----+

Search This Blog

RT

scala - Break big spark sql query into smaller queries and merge it -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

node.js - Node js - Trying to send POST request, but it is not loading javascript content -