How to improve spark perf by joining dataframe on a known key -


i have 2 big parquet dataframes , want join them on userid.

what should high performance :

should modify code write files in order :

  • partitionby on userid (very sparse).
  • partitionby on first n char of userid (afaik, if data partitioned on same key, join occur no shuffle)

on read side, better use rdd or dataframe ?

you can perform bucketby operation before save parquet file.

val num_buckets = 20 df1.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_1") df2.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_2") spark.sql("select * join b on a.num1 = b.num2").collect() 

doing way prevent shuffle when join operation performed.

keep in mind in order need enable hive .enablehivesupport, save parquet operation doesn't support bucketby method.


Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -