Font Size: a A A

Research On Performance Optimization Of MapReduce Model

Posted on:2018-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:L D DingFull Text:PDF
GTID:2348330515464656Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,cloud computing and Internet of things,e-commerce,e-government,social networking and other new applications brings great convenience to people's daily lives and work,but also the way the data generated more diverse,the amount of data was explosive growth.The simplicity,expansibility and high efficiency of MapReduce make it become a popular mass data processing model in large data age.However,MapReduce existing data distribution mechanism easily lead to the problem of input data skew,resulting in a few points on the distribution of most of the data,resulting in the load of each node is different.Most of the massive amounts of data that need to be processed in real life are skewed,ie,Zipf distribution,which results in unequal numbers of data corresponding to some data.At the same time,it is easy to cause the same data of the partition to converge to the node with low performance,which causes the execution time of each node to be different.For intensive data tasks,in the process of pulling data will cause a lot of disk access and competitive network broadband resources and other bottlenecks.One of the key issues MapReduce performance optimization is data skew.In order to optimize the problem of MapReduce data skew,the load balancing optimization mechanism of MapReduce online sampling partition is proposed.Before the task begins,the source data is first sampled and analyzed to predict the characteristics of the source data distribution.According to the data distribution characteristics,dynamically call different data partition optimization strategy.During the execution of the task,the load of each node is monitored in real time,and the corresponding data partitioning strategy is dynamically optimized.To improve the performance of MapReduce in heterogeneous environments,presents a Dynamic MapReduce scheduling based on the Time-aware of node jobs in Heterogeneous Environments: DTHE.Before job processing,a part of the tasks are selected as node sample tasks and issued firstly.During other tasks processing,DTHE analyzes sample tasks,and predicts node performance and the distribution of data,and dynamically takes appropriate scheduling strategy;it can timely monitor node status of tasks during job running and pull the next node task data to the local memory in advance.Experimental results show that in heterogeneous environments,DTHE can reduce job execution time by 5.1%,and decrease the number of I/O,effectively improve the performance of MapReduce.
Keywords/Search Tags:MapReduce, dynamic scheduling, data skew, sampling partition, Performance Optimization
PDF Full Text Request
Related items