Font Size: a A A

An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment

Posted on:2017-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:X S ZhangFull Text:PDF
GTID:2428330488471859Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network information technology,the global use Internet the number continued to increase,the Internet has in many industries were practice and application,bring the Internet data the bulk of the increase,analysis and processing of the vast amount of data on the Internet is a vital realistic problem,and also to distributed computing provides new opportunities for development.Google proposed by MapReduce,which is characterized by high reliability,simple programming,can automatically and parallel processing operations,is a kind of distributed parallel programming model for dealing with large data.Spark is based on distributed memory computing of the parallel computing framework,spark through introducing the RDD data model and operation mode based on memory,which can well adapt to large data data mining the scene,and in the iterative calculation is superior to Hadoop,and quickly became the focus for the large number of enterprises and scholars.In addition,many research institutes and enterprises began to apply Spark in the treatment and research of massive data.Since MapReduce has become an efficient and popular programming framework for parallel data processing,the skew of intermediate data key values is an important bottleneck in the system performance.When MapReduce processing of the data distribution is not uniform,will cause some tasks than other tasks to run slower,and the execution time of the whole operation is determined by the slowest of the task,when processing data in the presence of tilt it will lead to the processing of data resulting in imbalances in the distribution of "short leg" operations,and ultimately affect the overall operating results.Therefore,the completion time of the whole operation is increased,and the performance of the system is decreased.The data skew problem in MapReduce can be solved by the method of formulating the distribution plan in advance by the statistics of the key value.In order to solve the problem of load imbalance of buckets container in shuffle process under the framework of spark computing.In this paper,we present a segmentation and combination algorithm for skew intermediate SCID(segmentation and combination algorithm for skew intermediate data).Because the number of keys values can not be counted,unless the input data is processed by the map task.This paper based on reservoir sampling algorithm distribution data to obtain intermediate values of key.Compared with the original buckets data loading mechanism,SCID according to the key/value tuple of each map task to sort the data size,and orderly fill in the relevant buckets.If a collection exceeds the current buckets capacity will be split.After filling the buckets,the rest of the clusters will go to the next iteration.In this way,the total size of the data in each bucket is approximately equal.For each map task,each reduce task gets the intermediate result from a specific buckets,so that the number of map tasks in each bucket reaches the load balance at the reduce task.We run the SCID algorithm on 1.1.0 Spark and evaluate its performance by extensive use of standard Benchmark,such as:Text,Search Word,Count Sort.Experimental results show that our algorithm not only can achieve a higher overall average load balance performance,but also to varying degrees of data skew reduces the execution time of the job.
Keywords/Search Tags:Spark, data skew, data sampling, MapReduce, load balancing
PDF Full Text Request
Related items