Research On Job Scheduling Strategy Based On Hadoop

Posted on:2016-08-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Dai

Full Text:PDF

GTID:2308330473955933

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the Internet scale keeps growing up, more and more data needs to be stored and processed. Traditional server cluster can not meet the above demand and Cloud computing which has attracted more and more attention from academia and industry has been becoming a leading solution for this. Many companies have launched their own cloud computing platform, and most of them are developed based on Hadoop. Hadoop is an open-source distributed framework for operation of large data storage and parallel computing on large clusters. Application developers only need to follow the interface requirements associated with distributed processing without attention to the underlying details. The performance of Hadoop platform is closely to its job scheduling algorithm and an appropriate scheduling algorithm can greatly improve resource rates and the system throughput. However, the existing job scheduling algorithms have many shortcomings so that it has vital significance to optimize and improve job scheduling algorithms.In view of the above situation, this thesis will carry out the research on job scheduling algorithms and the main work is as follows:1. This thesis introduces Hadoop overall architecture in detail from Hadoop distributed file system and Mapreduce parallel programming framework.2. This thesis analyzes the job scheduling process under Hadoop platform and focuses on the existing several job scheduling algorithms: FIFO scheduling algorithm,Capacity scheduling algorithm, Fair scheduling algorithms, analyzes the arithmetic ideas and main advantages and disadvantages.3. According to the existing job scheduling algorithm considering the insufficient situation of local problems, this thesis proposes an improved scheduling algorithm by introducing data prefetching technique. The algorithm which can load task in advance for the node that will be assigned the task improves the efficiency of node task execution and the overall efficiency of the system.4. In view of the slow task recognition problem of existing job scheduling algorithm inappropriately, this thesis proposes an improved scheduling algorithm by introducing the K-means algorithm. The algorithm draws the advantage of the SAMR that records historical weight of each node and divides these data into different clustersfor different nodes based on K-means. The algorithm can predict the residual execution time of the task effectively so as to improve the recognition rate of the slow task.5. In order to verify the effectiveness of the above algorithms, this thesis sets up a small-scale Hadoop cluster. The experiment analyzes the algorithms from two aspects of analysis of the log information and the task execution phase weight. And experimental results demonstrate the feasibility and effectiveness of the improved algorithms.

Keywords/Search Tags:

Hadoop, Job scheduling, Data prefetching, K-means

PDF Full Text Request

Related items

1	Research On Optimization And Improvement Of MapReduce Job Scheduling Algorithm
2	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
3	A Research And Implementation With Improved K-Means Clustering Algorithm To Image Retrieval System Based On Hadoop Platform
4	Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment
5	Research Of Hadoop Job Scheduling Algorithm In Big Data
6	Research And Improvement Of Task Scheduling Algorithm In Hadoop
7	Research On Scheduling Strategy Based On Hadoop
8	Research Of Job Scheduling Algorithms In Hadoop Platform
9	Research On Web Log Data Analysis System Based On Hadoop
10	The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform