Font Size: a A A

Research Of MapReduce Data Skew And Task Scheduling In Heterogeneous Environments

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:M Y JiaFull Text:PDF
GTID:2428330629450532Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of internet,internet of things and artificial intelligence,cloud computing and big data technology have a lot progress.Hadoop platform is the key technology of cloud computing and big data.Graphs is a Hadoop cluster parallel processing large data sets on the calculation of effective and widely used framework,However,MapReduce framework is inefficient in processing slanted data in heterogeneous environment.This paper studies and analyzes the root causes of low efficiency of MapReduce framework in dealing with skewed data in heterogeneous environment,and finds that data skew,cluster heterogeneity and network traffic are the three main problems that significantly affect the performance of MapReduce applications.Aiming at the problem of unbalanced load caused by MapReduce framework in processing skewed data,a two-stage partition based on parallel similar random sampling greedy algorithm is proposed to replace the original hash partition strategy by default.This sampling is a fast and low error prediction key distribution result based on Hadoop random sampling within the error range of given sample ratio or specific confidence.The advantage is to reduce the cost of sampling by using the parallelism of MapReduce framework,and to determine the appropriate sampling rate by using argMin()function,so as to reduce the cost of sampling and improve the accuracy of sampling.Combining with greedy algorithm to partition the intermediate data,the load on the reduce node tends to average,so as to achieve load balancing.In the homogeneous environment,wordcount experiments are carried out with the original hash algorithm and cluster splitting algorithm under different tilt data.The experimental results show that the random sampling algorithm proposed in this paper greedy algorithm partition has good sampling similarity,accuracy and sampling cost minimization.Compared with the comparison algorithm,it reduces the skew of reduce data,realizes load balancing and provides any responsibility The execution time of the service.Aiming at the problem of MapReduce performance degradation in heterogeneous environment,a dynamic smooth weighted polling scheduling algorithm based on reduce nodes is proposed to comprehensively consider the performance of reduce nodes.Thealgorithm takes the computing power of nodes and the local data transmission in the shuffle stage as the weight of the node to comprehensively consider the performance of reduce nodes,and evenly selects reduce nodes through the smooth weighted polling scheduling algorithm.In the heterogeneous environment,this algorithm and FIFO,SARS adaptive algorithm are used in wordcourt and image processing experiments under different tilt data.The experimental results show that in the heterogeneous environment,not only reduce load balancing,but also reduce the network traffic and task execution time in the shuffle stage.
Keywords/Search Tags:Hadoop, MapReduce, Greedy algorithm partition, Parallel sampling, Smooth weighted polling algorithm
PDF Full Text Request
Related items