Research On Performance Optimization Of MapReduce Model

Posted on:2018-10-01

Degree:Master

Type:Thesis

Country:China

Candidate:L D Ding

Full Text:PDF

GTID:2348330515464656

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet,cloud computing and Internet of things,e-commerce,e-government,social networking and other new applications brings great convenience to people's daily lives and work,but also the way the data generated more diverse,the amount of data was explosive growth.The simplicity,expansibility and high efficiency of MapReduce make it become a popular mass data processing model in large data age.However,MapReduce existing data distribution mechanism easily lead to the problem of input data skew,resulting in a few points on the distribution of most of the data,resulting in the load of each node is different.Most of the massive amounts of data that need to be processed in real life are skewed,ie,Zipf distribution,which results in unequal numbers of data corresponding to some data.At the same time,it is easy to cause the same data of the partition to converge to the node with low performance,which causes the execution time of each node to be different.For intensive data tasks,in the process of pulling data will cause a lot of disk access and competitive network broadband resources and other bottlenecks.One of the key issues MapReduce performance optimization is data skew.In order to optimize the problem of MapReduce data skew,the load balancing optimization mechanism of MapReduce online sampling partition is proposed.Before the task begins,the source data is first sampled and analyzed to predict the characteristics of the source data distribution.According to the data distribution characteristics,dynamically call different data partition optimization strategy.During the execution of the task,the load of each node is monitored in real time,and the corresponding data partitioning strategy is dynamically optimized.To improve the performance of MapReduce in heterogeneous environments,presents a Dynamic MapReduce scheduling based on the Time-aware of node jobs in Heterogeneous Environments: DTHE.Before job processing,a part of the tasks are selected as node sample tasks and issued firstly.During other tasks processing,DTHE analyzes sample tasks,and predicts node performance and the distribution of data,and dynamically takes appropriate scheduling strategy;it can timely monitor node status of tasks during job running and pull the next node task data to the local memory in advance.Experimental results show that in heterogeneous environments,DTHE can reduce job execution time by 5.1%,and decrease the number of I/O,effectively improve the performance of MapReduce.

Keywords/Search Tags:

MapReduce, dynamic scheduling, data skew, sampling partition, Performance Optimization

PDF Full Text Request

Related items

1	Research Of MapReduce Data Skew And Task Scheduling In Heterogeneous Environments
2	Research On Partition Loading Balance Based On Spark Data Skew
3	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
4	Load Balancing Algorithm Based On Data Skew Of MapReduce
5	The Research Of Skew With Sampling Technique In MapReduce
6	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
7	Research On Resource-aware Skew Mitigation For Mapreduce
8	MapReduce-based Resource Scheduling Model And Algorithm Research In Cloud Environment
9	Research On MapReduce Performance Optimization Based On Hadoop
10	Research And Optimization Of Join Algorithm Based On MapReduce