| With rapid development of the Internet, the number of people using the Internet grows rapidly and digital information has an explosive growth and it becomes a hot spot for big data analyzing and processing. After Google introduces its big data computation framework MapReduce and distributed file system GFS, open-source software Hadoop has developed rapidly and becomes the most popular platform for big data processing, which was designed on their ideas. Hadoop provides an easy interface for developers who can only focus on the map and reduce functions and reasonably arranges the execution of jobs and tasks through job scheduling without user intervention. Job scheduler is one of core modules in Hadoop, and its goal is maximizing the use of the cluster resources through reasonable order of execution of many jobs and reasonable selection of tasks. Hadoop currently offers three job scheduling algorithms, which are FIFO scheduler, Capacity scheduler and Fair scheduler. FIFO scheduler is simple and easy to implement, but it does not support sharing resources for multi-users and multi-jobs. Capacity and Fair scheduler support sharing cluster’s resources, increase throughput, decrease response time, but they need complicated configuration and administrator’s fully understanding of the cluster’s resources and types of users and jobs.Based on domestic and foreign research on Hadoop, The paper analyzes the core idea and scheduling policy of existing scheduling algorithms and improves the slot allocation algorithm in Fair Scheduler. Then it analyzes the advantages and disadvantages and put up an scheduling algorithm based on Bayesian classification to overcome complicated configurations of the existing scheduling algorithms. The algorithm ensures jobs’ running on nodes without overloading based through Bayesian learning and classifying. The paper then pre-processes jobs to CPU intensive and I/O intensive according to requirements of jobs to use computing resources more effectively. The paper’s contents are as follows.Firstly, deeply analyze and compare the FIFO Scheduler, Capacity Scheduler and Fair Scheduler in Hadoop, including their core idea, configuration, displaying of the pseudo-code, flowchart form with complexity description, features, advantages and disadvantages. Then it improves the slot allocation algorithm in Fair Scheduler to allocate remaining slots as fair as possible.Secondly, the paper puts up an scheduling algorithm to decrease and overcome the complicated configurations in existing scheduling algorithms. The algorithm classifies jobs for schedulable and not schedulable using Bayesian classifier which uses the job scheduling and executing history for learning according to features of jobs and nodes. Thus it schedules jobs to execute without nodes overloading as far as possible to improve the scheduling accuracy and resource usage of nodes.Thirdly, the paper puts up an pre-processing step to classify jobs for CPU intensive and I/O intensive and schedules them separately to improve resource usage.Fourthly, the paper chooses different types of typical jobs for experiment, and give assessment methods for the algorithm them. Then give results for scheduling accuracy, response time and cluster’s resources usage ratio and analyzes the results comparing with the existing scheduling algorithms. |