Font Size: a A A

Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:L M ZhaoFull Text:PDF
GTID:2428330623981445Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Spark is a distributed system based on the Map/Reduce computing model.For large-scale data processing,each task will be split into many Map processes and Reduce processes,which are executed in parallel on each node.Shuffle operation is a bridge for connecting Map process and Reduce process,which will generate a lot of network transmission overhead and I/O overhead.Spark defines broadcast threshold parameters.For scenarios where the size of one of joining tables is less than the threshold,Spark uses Broadcast Hash Join algorithm,which avoids the Shuffle operation of the two tables.However,when performing outer join,Spark does not make full use of the relationship between the size of the effective matching tuples in the two joining tables and the broadcast threshold,which limits the use of Broadcast Hash Join.During the Join operation between two large tables,if the Join column of the two tables do not match exactly,the Sort Merge Join algorithm in Spark will perform a Shuffle operation on a large amount of data,which seriously affects the execution efficiency.In response to the above problems,this thesis optimizes the above two algorithms based on the Semi Join.The main work is as follows:(1)Improve the applicability of Broadcast Hash Join algorithm in outer join queries.Based on Semi Join,this thesis proposes a join implementation algorithm called Semi Broadcast Hash Join.Taking left outer join as an example,define the left table as base table and the right table as outer table.If the data size of the outer table is greater than the broadcast threshold,but the data size of the effective matching tuples of the outer table is smaller than the broadcast threshold,then use Semi Join algorithm to optimize Join operation.First,optimization plan uses Semi Join to construct a HashMap on the Join column data of the base table.Then,it filters the outer table by the HashMap.Finally,it performs Broadcast Hash Join between the base table and the filtered data set of the outer table.We define the above optimization plan as Semi Broadcast Hash Join.In this scenario,you can effectively use the advantages of the Broadcast Hash Join algorithm to avoid Shuffle operations.(2)Based on Semi Join,this thesis proposes a join implementation algorithm called Semi Sort Merge Join.By filtering the data in the right table through the HashMap constructed from the join data of the left table,it can effectively reduce the amount of data needed to be transferred during the shuffle operation.At the same time,the parallelism of Join operation,which is the number of partitions of Shuffle,is dynamically set according to the cluster configuration.(3)Use TPC-H data set to perform performance test on the optimization scheme of the above two algorithms.In the case where the amount of outer data is greater than the broadcast threshold,but the size of tuples in outer table that matches the base table is less than the broadcast threshold,Semi Broadcast Hash Join has better performance than Sort Merge Join.Experimental results show that the maximum performance improvement after optimization is about 25%.For Join operations between large tables where the Join column data does not match exactly,the Semi Sort Merge Join algorithm can effectively reduce the cost of the Shuffle operation.The smaller the amount of matching data between the right table and the left table,the more obvious the optimization effect of the algorithm.Experiments results show that the maximum performance improvement after optimization is about 20%.
Keywords/Search Tags:Spark, Semi Join, Shuffle
PDF Full Text Request
Related items