An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment

Posted on:2017-07-27

Degree:Master

Type:Thesis

Country:China

Candidate:X S Zhang

Full Text:PDF

GTID:2428330488471859

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of network information technology,the global use Internet the number continued to increase,the Internet has in many industries were practice and application,bring the Internet data the bulk of the increase,analysis and processing of the vast amount of data on the Internet is a vital realistic problem,and also to distributed computing provides new opportunities for development.Google proposed by MapReduce,which is characterized by high reliability,simple programming,can automatically and parallel processing operations,is a kind of distributed parallel programming model for dealing with large data.Spark is based on distributed memory computing of the parallel computing framework,spark through introducing the RDD data model and operation mode based on memory,which can well adapt to large data data mining the scene,and in the iterative calculation is superior to Hadoop,and quickly became the focus for the large number of enterprises and scholars.In addition,many research institutes and enterprises began to apply Spark in the treatment and research of massive data.Since MapReduce has become an efficient and popular programming framework for parallel data processing,the skew of intermediate data key values is an important bottleneck in the system performance.When MapReduce processing of the data distribution is not uniform,will cause some tasks than other tasks to run slower,and the execution time of the whole operation is determined by the slowest of the task,when processing data in the presence of tilt it will lead to the processing of data resulting in imbalances in the distribution of "short leg" operations,and ultimately affect the overall operating results.Therefore,the completion time of the whole operation is increased,and the performance of the system is decreased.The data skew problem in MapReduce can be solved by the method of formulating the distribution plan in advance by the statistics of the key value.In order to solve the problem of load imbalance of buckets container in shuffle process under the framework of spark computing.In this paper,we present a segmentation and combination algorithm for skew intermediate SCID(segmentation and combination algorithm for skew intermediate data).Because the number of keys values can not be counted,unless the input data is processed by the map task.This paper based on reservoir sampling algorithm distribution data to obtain intermediate values of key.Compared with the original buckets data loading mechanism,SCID according to the key/value tuple of each map task to sort the data size,and orderly fill in the relevant buckets.If a collection exceeds the current buckets capacity will be split.After filling the buckets,the rest of the clusters will go to the next iteration.In this way,the total size of the data in each bucket is approximately equal.For each map task,each reduce task gets the intermediate result from a specific buckets,so that the number of map tasks in each bucket reaches the load balance at the reduce task.We run the SCID algorithm on 1.1.0 Spark and evaluate its performance by extensive use of standard Benchmark,such as:Text,Search Word,Count Sort.Experimental results show that our algorithm not only can achieve a higher overall average load balance performance,but also to varying degrees of data skew reduces the execution time of the job.

Keywords/Search Tags:

Spark, data skew, data sampling, MapReduce, load balancing

PDF Full Text Request

Related items

1	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
2	The Research Of Skew With Sampling Technique In MapReduce
3	Research And Strategy On Data Skew Problem Based On MapReduce
4	A Research Of Load Balancing Algorithms For Data Skew In Spark
5	Load Balancing Algorithm Based On Data Skew Of MapReduce
6	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
7	Research On Lightweight Load Balancing Under Mapreduce
8	Research Of Data Skew On Spark Based On Imporved Partition Method
9	Research On Partition Loading Balance Based On Spark Data Skew
10	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark