Font Size: a A A

Implementation And Optimization For Join Operation In Spark

Posted on:2017-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:W H ZhangFull Text:PDF
GTID:2348330536467690Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The emergence of smart phones,tablets,wearables,and IoT devices is generating a large scale of data,leading people enter into a big data era.How to deal with these large complex data efficiently has become a problem to be resolved.A platform for processing big data usually contains several import components,such as storage,cluster scheduler,fault tolerance,data processing engine and calculation algorithms.Among them,the data processing engine is the core component.Spark was born from AMPLab in 2009.Comparing with Hadoop which is based on I/O,its calculation model based on memory has a great advantage in performance,especially in the iterative calculations.Based on Spark Core,components such as Spark Streaming,Spark SQL,MLlib,GraphX,SparkR consist the good ecological system of Spark.In order to support structured data processing,Spark SQL came up.It can switch the SQL statements to tasks in Spark and is compatible with Hive.With lots of comprehensive source code reading and tests,this thesis deeply analyzes the implementation of Spark SQL and selects JOIN algorithms of the main research content.With the analysis of three classes of JOIN in Spark SQL,Inner Join,Semi Join and Outer Join,this thesis puts forward an algorithm called SelectedBroadcastHashOuterJoin which is more suitable for JOIN operations with a small table and an algorithm called SortMergeOuterJoin which supports the non-equi outer join.This thesis also comes up with an idea based on Hash to remove duplicate data in a broadcast table and an idea based on Bloom Filter to filter the invalid data in two big tables for equal join.In this thesis,we design the algorithms and optimization ideas,further implement them in Spark SQL.With setting up Spark cluster on the Aliyun,we test the performance of the above algorithms and optimization ideas with the testing data,showing that SelectedBroadcastHashOuterJoin outperforms HashOuterJoin by about 20% in outer join operation,and SortMergeOuterJoin can effectively support Non-Equi outer join while maintaining the similar performance with the existing platform.The optimization based on Hash to remove duplicate data in a broadcast table can significantly improve efficiency when there are many duplicates,and the idea based on Bloom Filter to filter the invalid data in two big tables for equal join improves the efficiency by about 10%.In the end,we summarize the related works and the direction for future work.
Keywords/Search Tags:big data, Spark, Spark SQL, JOIN
PDF Full Text Request
Related items