Font Size: a A A

Research On Job Scheduling Strategy Based On Hadoop

Posted on:2016-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y X DaiFull Text:PDF
GTID:2308330473955933Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the Internet scale keeps growing up, more and more data needs to be stored and processed. Traditional server cluster can not meet the above demand and Cloud computing which has attracted more and more attention from academia and industry has been becoming a leading solution for this. Many companies have launched their own cloud computing platform, and most of them are developed based on Hadoop. Hadoop is an open-source distributed framework for operation of large data storage and parallel computing on large clusters. Application developers only need to follow the interface requirements associated with distributed processing without attention to the underlying details. The performance of Hadoop platform is closely to its job scheduling algorithm and an appropriate scheduling algorithm can greatly improve resource rates and the system throughput. However, the existing job scheduling algorithms have many shortcomings so that it has vital significance to optimize and improve job scheduling algorithms.In view of the above situation, this thesis will carry out the research on job scheduling algorithms and the main work is as follows:1. This thesis introduces Hadoop overall architecture in detail from Hadoop distributed file system and Mapreduce parallel programming framework.2. This thesis analyzes the job scheduling process under Hadoop platform and focuses on the existing several job scheduling algorithms: FIFO scheduling algorithm,Capacity scheduling algorithm, Fair scheduling algorithms, analyzes the arithmetic ideas and main advantages and disadvantages.3. According to the existing job scheduling algorithm considering the insufficient situation of local problems, this thesis proposes an improved scheduling algorithm by introducing data prefetching technique. The algorithm which can load task in advance for the node that will be assigned the task improves the efficiency of node task execution and the overall efficiency of the system.4. In view of the slow task recognition problem of existing job scheduling algorithm inappropriately, this thesis proposes an improved scheduling algorithm by introducing the K-means algorithm. The algorithm draws the advantage of the SAMR that records historical weight of each node and divides these data into different clustersfor different nodes based on K-means. The algorithm can predict the residual execution time of the task effectively so as to improve the recognition rate of the slow task.5. In order to verify the effectiveness of the above algorithms, this thesis sets up a small-scale Hadoop cluster. The experiment analyzes the algorithms from two aspects of analysis of the log information and the task execution phase weight. And experimental results demonstrate the feasibility and effectiveness of the improved algorithms.
Keywords/Search Tags:Hadoop, Job scheduling, Data prefetching, K-means
PDF Full Text Request
Related items