Implementation And Optimization For Join Operation In Spark

Posted on:2017-09-30

Degree:Master

Type:Thesis

Country:China

Candidate:W H Zhang

Full Text:PDF

GTID:2348330536467690

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The emergence of smart phones,tablets,wearables,and IoT devices is generating a large scale of data,leading people enter into a big data era.How to deal with these large complex data efficiently has become a problem to be resolved.A platform for processing big data usually contains several import components,such as storage,cluster scheduler,fault tolerance,data processing engine and calculation algorithms.Among them,the data processing engine is the core component.Spark was born from AMPLab in 2009.Comparing with Hadoop which is based on I/O,its calculation model based on memory has a great advantage in performance,especially in the iterative calculations.Based on Spark Core,components such as Spark Streaming,Spark SQL,MLlib,GraphX,SparkR consist the good ecological system of Spark.In order to support structured data processing,Spark SQL came up.It can switch the SQL statements to tasks in Spark and is compatible with Hive.With lots of comprehensive source code reading and tests,this thesis deeply analyzes the implementation of Spark SQL and selects JOIN algorithms of the main research content.With the analysis of three classes of JOIN in Spark SQL,Inner Join,Semi Join and Outer Join,this thesis puts forward an algorithm called SelectedBroadcastHashOuterJoin which is more suitable for JOIN operations with a small table and an algorithm called SortMergeOuterJoin which supports the non-equi outer join.This thesis also comes up with an idea based on Hash to remove duplicate data in a broadcast table and an idea based on Bloom Filter to filter the invalid data in two big tables for equal join.In this thesis,we design the algorithms and optimization ideas,further implement them in Spark SQL.With setting up Spark cluster on the Aliyun,we test the performance of the above algorithms and optimization ideas with the testing data,showing that SelectedBroadcastHashOuterJoin outperforms HashOuterJoin by about 20% in outer join operation,and SortMergeOuterJoin can effectively support Non-Equi outer join while maintaining the similar performance with the existing platform.The optimization based on Hash to remove duplicate data in a broadcast table can significantly improve efficiency when there are many duplicates,and the idea based on Bloom Filter to filter the invalid data in two big tables for equal join improves the efficiency by about 10%.In the end,we summarize the related works and the direction for future work.

Keywords/Search Tags:

big data, Spark, Spark SQL, JOIN

PDF Full Text Request

Related items

1	Research On Query Analysis And Optimization Based On Spark System
2	Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data
3	Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine
4	Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform
5	Reseach On Optimizing Top-k Join Queries Based On Spark
6	Research On Equi-Join Optimization Algorithms On Spark SQL
7	Research And Implementation Of Data Hybrid Computing Platform Based On Spark
8	Structured Data Processing And Performance Optimization Of Spark SQL
9	Research And Implementation Of Economic Dynamic Management System Based On Spark Technology
10	Spark-based Massive Data Analysis And Performance Optimization