java - Relational transformations in Spark -
i'm trying using spark dataset load quite big data of (let's say) persons subset data looks follows.
|age|maritalstatus| name|sex| +---+-------------+--------+---+ | 35| m| joanna| f| | 25| s|isabelle| f| | 19| s| andy| m| | 70| m| robert| m| +---+-------------+--------+---+
my need have relational transformations 1 column derives value other column(s). example, based on "age" & "sex" of each person record need put mr or ms/mrs in front of each "name" attribute. example, person "age" on 60, need mark him or senior citizen (derived column "seniorcitizen" y).
my final need transformed data follows:
+---+-------------+---------------------------+---+ |age|maritalstatus| name|seniorcitizen|sex| +---+-------------+---------------------------+---+ | 35| m| mrs. joanna| n| f| | 25| s| ms. isabelle| n| f| | 19| s| mr. andy| n| m| | 70| m| mr. robert| y| m| +---+-------------+--------+------------------+---+
most of transformations spark provides quite static , not dyanmic. example, defined in examples here , here.
i'm using spark datasets because i'm loading relational data source if may suggest better way of doing using plain rdds, please do.
you can use withcolumn
add new column, seniorcitizen
using where
clause , updating name
can use user defined function (udf)
below
import spark.implicits._ import org.apache.spark.sql.functions._ //create dummy data val df = seq((35, "m", "joanna", "f"), (25, "s", "isabelle", "f"), (19, "s", "andy", "m"), (70, "m", "robert", "m") ).todf("age", "maritalstatus", "name", "sex") // create udf update name according age , sex val append = udf((name: string, maritalstatus:string, sex: string) => { if (sex.equalsignorecase("f") && maritalstatus.equalsignorecase("m")) s"mrs. ${name}" else if (sex.equalsignorecase("f")) s"ms. ${name}" else s"mr. ${name}" }) //add 2 new columns using withcolumn df.withcolumn("name", append($"name", $"maritalstatus", $"sex")) .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y")).show
output:
+---+-------------+------------+---+-------------+ |age|maritalstatus| name|sex|seniorcitizen| +---+-------------+------------+---+-------------+ | 35| m| mrs. joanna| f| n| | 25| s|ms. isabelle| f| n| | 19| s| mr. andy| m| n| | 70| m| mr. robert| m| y| +---+-------------+------------+---+-------------+
edit:
here the output without using udf
df.withcolumn("name", when($"sex" === "f", when($"maritalstatus" === "m", concat(lit("ms. "), df("name"))).otherwise(concat(lit("ms. "), df("name")))) .otherwise(concat(lit("ms. "), df("name")))) .withcolumn("seniorcitizen", when($"age" < 60, "n").otherwise("y"))
hope helps!
Comments
Post a Comment