Font Size: a A A

Hadoop Task Scheduling Algorithm Optimization About Data Locality

Posted on:2017-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2348330491954808Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and mobile Internet, the explosive of connective data terminal and users, the exponential increasement of data size, traditional data processing method faced with great difficulties. In traditional data processing, the data usually processes on a terminal.This is a intensive process method. For big data, it's need long time to process, which is obviously unreasonable. While compared with traditional data process method, cloud computing has many advantages, such as the super large-scale, virtualization, high reliability, versatility, scalability, on-demand service, extremely cheap, therefore, cloud computing has also been an unprecedented development. Under this background, the Apache Lucene founder Doug Cutting create Hadoop, Hadoop based on Google published three papers: Map Reduce[2], Google File System[3] and big Table[4], which is completely distributed processing way different with intensive. In Hadoop2.x,YARN(Another Resource Negotiator Yet) replaced the Map Reduce to become the Hadoop calculation model. With the rapid development of cloud computing, Hadoop open source framework has also experienced a number of versions, so Hadoop resource management system has also been transformed into YARN, job scheduling is also more and more perfect. In the Hadoop scheduling algorithm research, there are many aspects of the research direction, but the data locality has been the most important. Also in Hadoop enterprise actual use, the actual work is mostly small operations, these small jobs there will be the problem of poor local, job data often in the cluster transmission is likely to cause cluster network congestion and delay, therefore in particular need to improve the data locality.In Hadoop, we introduce the local algorithm on the base of relevant research. This algorithm has been proved to be able to improve the local character of the data, which is based on the basic guarantee of the response time and the execution time. The core idea of the algorithm is to balance the task of waiting time and transmission time to decide whether the task is executed locally. In order to achieve the above objectives, this paper introduces in detail the principle of the yarn in the scheduler, on the basic architecture, scheduling strategy, to seize the model is described in detail. At the same time, this paper improves the data locality of the algorithm, in the algorithm to join the matching mechanism, when a match is not successful label request node, in a scheduling cancel assessment waiting time and transmission time, directly assign tasks to the nodes. In this paper, the improved data locality algorithm is added to Scheduler Fair. Specific experiments, this paper selects the newer version of Hadoop cluster deployment, experiments in the setting of homogeneous systems, pros and cons and fair scheduler is local and job execution time, for experimental verification of the completeness, select the four dimensions of difference data to verify the results. The experimental results show that when the small cluster operations more improved data locality of the algorithm in the data locality with respect to the fair scheduler to an average increase of 10%.
Keywords/Search Tags:YARN, scheduler, Data Locality, Algorithm
PDF Full Text Request
Related items