Font Size: a A A

Research And Strategy On Data Skew Problem Based On MapReduce

Posted on:2018-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:B TongFull Text:PDF
GTID:2348330542992623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cloud computing and Big data have become one of the most popular topics in recent years.With the continuous development of information technology,they have brought a technological revolution to all sectors of society.At the same time,with the popularity of the Internet,many people have joined in the study of Cloud computing and Big data.In such a background,there is a large amount of information every day,so the data is exponential growth,and such a large amount of data behind the infinite value is also worth digging,many well-known domestic and foreign IT companies take them as primary strategic means.As an efficient and reliable parallel computing model,MapReduce was widely applied to various fileds because of its efficient and reliable features.But actually,MapReduce itself has also some limitations.When it deals with unevenly distributed data,after the Map stage,the tasks in Reducer nodes are unbalanced,and buckets effect will occur,some nodes with light tasks have finished and came into idle state,while the nodes with negative tasks are still in the state of computing,thus,the performance of the cluster is reduced.In this paper,the corresponding strategies are put forward to solve the problem above.A reservoir sampling algorithm is used to sample the original data in the first stage of the strategy,so the distribution of key and value in sample can be used to estimate the distribution of the whole data through starting a sampling pre-treatment job,at the same time,the consistent hashing algorithm is used instead of the default Hash partition algorithm in order to make the Reducer load balance in sampling pre-treatment job.The second stage of the strategy uses an improved Partitioner algorithm to develop a reasonable partition scheme based on the intermediate results of the preprocessing calculation.The overall running time and the Reducer nodes' load balance situation are verified in the experiment,and the strategy in this paper is compared with the traditional sampling method and the default Hash partition function.The experimental results show that the strategy in this paper has a better equalization effect when dealing with the tilt data in MapReduce.
Keywords/Search Tags:MapReduce, Skewed data, Reservoir sampling, Load balancing
PDF Full Text Request
Related items