Font Size: a A A

Research On Hadoop Distributed System Of Scheduling Alogrithm

Posted on:2017-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2308330485460360Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology, data volume in the world has showed explosive growth. It is difficult for traditional data processing techniques to meet people’s demands, so large data processing technologies become a key research in present society. For example, Hadoop has become the real standard of large data processing techniques with the characteristics of low cost, mature ecosystem, quickly solving problems. Hadoop is made up of HDFS and programming model MapReduce. Within the distributed computing environment, task scheduling, one of the key tasks in Hadoop, has direct impact on the performance of cluster. Therefore, task scheduling in Hadoop algorithm is one hot research issue in the Big data.Hadoop existing task scheduling algorithm will face with problems such as data locality and the imbalanced load of cluster in the environment of heterogeneous cluster and data-intensive work. This thesis will study these issues through HDFS and MapReduce scheduling algorithm analysis. There are methods to improve the cluster performance in a heterogeneous data-intensive environment in two aspects. Here are the main parts in my thesis:(1) The tactics of Heterogeneous Data Placement. For the heterogeneity of Hadoop cluster, first, it is to evaluate calculation for node performance based on EHP (heterogeneous node performance evaluation) algorithm through the physical configuration of each node machine, and then is data placement according to the ratio of the node performance.(2) Dynamic Delay Scheduling Algorithm Based on Load Blancing. It is proposed LBDDS by analyzing the deficiencies of the present delay scheduling. Based on the prior DDS, the scheduling will change the set static time through mathematical model to calculate dynamic delay time. Then, it is necessary to analysis and calculate the node loading level when the node requests task and the calculating method is also through the node performance parameters of the time. The cluster nodes are divided into low, medium and high load nodes based on the Topsis method based on the weight of entropy to calculate the system load degree, and then tasks are distributed according to the load degree. For example, it is prior to assign the task of local data for the medium and low load nodes, and it is one task for the present load to the lowest point when the local data nodes not reach in the waiting time. For high load node, it is not until the task completion that given the node evaluated during the next free node.It is proved that data placement strategy indeed improves the locality of task data and decreases completion time of total task based on heterogeneous node performance by experiment. With the default data placement strategy, LBDDS is slightly inferior to delay scheduling algorithm in terms of data locality through experiment analysis, but superior to delay scheduling algorithm in the completion time of total task and system load balance degree.Finally, the experiment shows that the combination strategy is similar to delay scheduling algorithm based on heterogeneous data placement in data locality, and is superior to delay scheduling algorithm in the completion time of total task and system load balance degree.
Keywords/Search Tags:Heterogeneous, MapReduce, Performance of Evaluation, Load- Balancing, Delay Scheduling
PDF Full Text Request
Related items