Research And Strategy On Data Skew Problem Based On MapReduce

Posted on:2018-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:B Tong

Full Text:PDF

GTID:2348330542992623

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cloud computing and Big data have become one of the most popular topics in recent years.With the continuous development of information technology,they have brought a technological revolution to all sectors of society.At the same time,with the popularity of the Internet,many people have joined in the study of Cloud computing and Big data.In such a background,there is a large amount of information every day,so the data is exponential growth,and such a large amount of data behind the infinite value is also worth digging,many well-known domestic and foreign IT companies take them as primary strategic means.As an efficient and reliable parallel computing model,MapReduce was widely applied to various fileds because of its efficient and reliable features.But actually,MapReduce itself has also some limitations.When it deals with unevenly distributed data,after the Map stage,the tasks in Reducer nodes are unbalanced,and buckets effect will occur,some nodes with light tasks have finished and came into idle state,while the nodes with negative tasks are still in the state of computing,thus,the performance of the cluster is reduced.In this paper,the corresponding strategies are put forward to solve the problem above.A reservoir sampling algorithm is used to sample the original data in the first stage of the strategy,so the distribution of key and value in sample can be used to estimate the distribution of the whole data through starting a sampling pre-treatment job,at the same time,the consistent hashing algorithm is used instead of the default Hash partition algorithm in order to make the Reducer load balance in sampling pre-treatment job.The second stage of the strategy uses an improved Partitioner algorithm to develop a reasonable partition scheme based on the intermediate results of the preprocessing calculation.The overall running time and the Reducer nodes' load balance situation are verified in the experiment,and the strategy in this paper is compared with the traditional sampling method and the default Hash partition function.The experimental results show that the strategy in this paper has a better equalization effect when dealing with the tilt data in MapReduce.

Keywords/Search Tags:

MapReduce, Skewed data, Reservoir sampling, Load balancing

PDF Full Text Request

Related items

1	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
2	Sampling-based Partitioning In Mapreduce For Skewed Data
3	Research On Lightweight Load Balancing Under Mapreduce
4	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
5	The Research Of Skew With Sampling Technique In MapReduce
6	The Research Of Load Balancing In Mapreduce Based On Data Locality
7	Research On Optimization Of Data Load Balancing In Hadoop Clusters And Application Of Haddoop Platform
8	Load Balancing Algorithm Based On Data Skew Of MapReduce
9	Research And Implementation Of Local Priority Scheduling Algorithm Based On Mapreduce For Massive Data
10	Optimizing Data Placement Of MapReduce On Ceph-based Framework Under Load-balancing Constraint