How to improve spark perf by joining dataframe on a known key -
i have 2 big parquet dataframe
s , want join them on userid
.
what should high performance :
should modify code write files in order :
partitionby
on userid (very sparse).partitionby
on first n char of userid (afaik, if data partitioned on same key, join occur no shuffle)
on read side, better use rdd
or dataframe
?
you can perform bucketby operation before save parquet file.
val num_buckets = 20 df1.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_1") df2.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_2") spark.sql("select * join b on a.num1 = b.num2").collect()
doing way prevent shuffle when join operation performed.
keep in mind in order need enable hive .enablehivesupport, save parquet operation doesn't support bucketby method.
Comments
Post a Comment