How to improve spark perf by joining dataframe on a known key -
i have 2 big parquet dataframes , want join them on userid.
what should high performance :
should modify code write files in order :
partitionbyon userid (very sparse).partitionbyon first n char of userid (afaik, if data partitioned on same key, join occur no shuffle)
on read side, better use rdd or dataframe ?
you can perform bucketby operation before save parquet file.
val num_buckets = 20 df1.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_1") df2.write.mode(savemode.overwrite).bucketby(num_buckets, "userid").saveastable("bucketed_large_table_2") spark.sql("select * join b on a.num1 = b.num2").collect() doing way prevent shuffle when join operation performed.
keep in mind in order need enable hive .enablehivesupport, save parquet operation doesn't support bucketby method.
Comments
Post a Comment