java - Relational transformations in Spark -


i'm trying using spark dataset load quite big data of (let's say) persons subset data looks follows.

|age|maritalstatus|    name|sex| +---+-------------+--------+---+ | 35|            m|  joanna|  f| | 25|            s|isabelle|  f| | 19|            s|    andy|  m| | 70|            m|  robert|  m| +---+-------------+--------+---+ 

my need have relational transformations 1 column derives value other column(s). example, based on "age" & "sex" of each person record need put mr or ms/mrs in front of each "name" attribute. example, person "age" on 60, need mark him or senior citizen (derived column "seniorcitizen" y).

my final need transformed data follows:

+---+-------------+---------------------------+---+ |age|maritalstatus|         name|seniorcitizen|sex| +---+-------------+---------------------------+---+ | 35|            m|  mrs. joanna|            n|  f| | 25|            s| ms. isabelle|            n|  f| | 19|            s|     mr. andy|            n|  m| | 70|            m|   mr. robert|            y|  m| +---+-------------+--------+------------------+---+ 

most of transformations spark provides quite static , not dyanmic. example, defined in examples here , here.

i'm using spark datasets because i'm loading relational data source if may suggest better way of doing using plain rdds, please do.

you can use withcolumn add new column, seniorcitizen using where clause , updating name can use user defined function (udf) below

import spark.implicits._  import org.apache.spark.sql.functions._ //create dummy data  val df = seq((35, "m", "joanna", "f"),     (25, "s", "isabelle", "f"),     (19, "s", "andy", "m"),     (70, "m", "robert", "m")   ).todf("age", "maritalstatus", "name", "sex")  // create udf update name according age , sex val append = udf((name: string, maritalstatus:string, sex: string) => {   if (sex.equalsignorecase("f") &&  maritalstatus.equalsignorecase("m")) s"mrs. ${name}"   else if (sex.equalsignorecase("f")) s"ms. ${name}"   else s"mr. ${name}" })  //add 2 new columns using withcolumn   df.withcolumn("name", append($"name", $"maritalstatus", $"sex"))   .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y")).show 

output:

+---+-------------+------------+---+-------------+ |age|maritalstatus|        name|sex|seniorcitizen| +---+-------------+------------+---+-------------+ | 35|            m| mrs. joanna|  f|            n| | 25|            s|ms. isabelle|  f|            n| | 19|            s|    mr. andy|  m|            n| | 70|            m|  mr. robert|  m|            y| +---+-------------+------------+---+-------------+ 

edit:

here the output without using udf

df.withcolumn("name",   when($"sex" === "f", when($"maritalstatus" === "m",  concat(lit("ms. "), df("name"))).otherwise(concat(lit("ms. "), df("name"))))     .otherwise(concat(lit("ms. "), df("name"))))   .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y")) 

hope helps!


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -