How to improve spark perf by joining dataframe on a known key -


i have 2 big parquet dataframes , want join them on userid.

what should high performance :

should modify code write files in order :

  • partitionby on userid (very sparse).
  • partitionby on first n char of userid (afaik, if data partitioned on same key, join occur no shuffle)

on read side, better use rdd or dataframe ?

you can perform bucketby operation before save parquet file.

val num_buckets = 20 df1.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_1") df2.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_2") spark.sql("select * join b on a.num1 = b.num2").collect() 

doing way prevent shuffle when join operation performed.

keep in mind in order need enable hive .enablehivesupport, save parquet operation doesn't support bucketby method.


Comments

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

Ansible warning on jinja2 braces on when -

html - How to custom Bootstrap grid height? -