java - Relational transformations in Spark -

August 15, 2012

i'm trying using spark dataset load quite big data of (let's say) persons subset data looks follows.

|age|maritalstatus|    name|sex| +---+-------------+--------+---+ | 35|            m|  joanna|  f| | 25|            s|isabelle|  f| | 19|            s|    andy|  m| | 70|            m|  robert|  m| +---+-------------+--------+---+

my need have relational transformations 1 column derives value other column(s). example, based on "age" & "sex" of each person record need put mr or ms/mrs in front of each "name" attribute. example, person "age" on 60, need mark him or senior citizen (derived column "seniorcitizen" y).

my final need transformed data follows:

+---+-------------+---------------------------+---+ |age|maritalstatus|         name|seniorcitizen|sex| +---+-------------+---------------------------+---+ | 35|            m|  mrs. joanna|            n|  f| | 25|            s| ms. isabelle|            n|  f| | 19|            s|     mr. andy|            n|  m| | 70|            m|   mr. robert|            y|  m| +---+-------------+--------+------------------+---+

most of transformations spark provides quite static , not dyanmic. example, defined in examples here , here.

i'm using spark datasets because i'm loading relational data source if may suggest better way of doing using plain rdds, please do.

you can use withcolumn add new column, seniorcitizen using where clause , updating name can use user defined function (udf) below

import spark.implicits._  import org.apache.spark.sql.functions._ //create dummy data  val df = seq((35, "m", "joanna", "f"),     (25, "s", "isabelle", "f"),     (19, "s", "andy", "m"),     (70, "m", "robert", "m")   ).todf("age", "maritalstatus", "name", "sex")  // create udf update name according age , sex val append = udf((name: string, maritalstatus:string, sex: string) => {   if (sex.equalsignorecase("f") &&  maritalstatus.equalsignorecase("m")) s"mrs. ${name}"   else if (sex.equalsignorecase("f")) s"ms. ${name}"   else s"mr. ${name}" })  //add 2 new columns using withcolumn   df.withcolumn("name", append($"name", $"maritalstatus", $"sex"))   .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y")).show

output:

+---+-------------+------------+---+-------------+ |age|maritalstatus|        name|sex|seniorcitizen| +---+-------------+------------+---+-------------+ | 35|            m| mrs. joanna|  f|            n| | 25|            s|ms. isabelle|  f|            n| | 19|            s|    mr. andy|  m|            n| | 70|            m|  mr. robert|  m|            y| +---+-------------+------------+---+-------------+

edit:

here the output without using udf

df.withcolumn("name",   when($"sex" === "f", when($"maritalstatus" === "m",  concat(lit("ms. "), df("name"))).otherwise(concat(lit("ms. "), df("name"))))     .otherwise(concat(lit("ms. "), df("name"))))   .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y"))

hope helps!

Search This Blog

RT

java - Relational transformations in Spark -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

html - How to custom Bootstrap grid height? -