Font Size: a A A

The Research Of Hadoop Scheduling Algorithm And Improvement Strategy

Posted on:2014-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:P LiFull Text:PDF
GTID:2248330398471571Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
We are now living in a network era, people’s lives have been inextricably linked with the Internet. People make friends, do shopping and search for all information they want to know through the Internet. The boom of online life produced a large amount of online data, for example:Facebook stored billions of pictures of its users’ on its servers; The spiders of Google and Baidu will collect T-level webpages everyday. In the face of this mass of information, traditional technologies can not meet the demand. Under this background, the concept of "cloud computing" rose up.Hadoop is an open-source distributed computing platform which is birthed under the background of "cloud computing" and "big data". It draws lessons from Google’s GFS and MapReduce technology. Developers can easily develop and run applications to handle massive data, without considering the details of distributed computing. Hadoop is open-source, it has a strong ability for distributed computing and can be easily used by developer. Thus in just a few years, it has become the most famous distributed computing platform.In this paper, a thorough study of hadoop, the most famous distributed computing platform is made. First, we get in touch with the concept and technical architecture of "cloud computing" born under the background of "big data". Then, we study the most famous distributed computing platform which is called hadoop, make in-depth analysis of the architecture model, working mechanism and reliability of its key technologies:HDFS and MapReduce. At last, we do research in the job scheduling mechanism of hadoop. After studying the most commonly used three scheduling algorithms:FIFO Scheduler, Capacity Scheduler and Fair Scheduler, we propose a series of improvement ideas, such as Job matching, Job combination and Priority strategy, then, implement an new scheduling algorithm based on those improvements ideas. The new scheduling algorithm is called Dynamic Priority Based Compose Scheduler, or DPBC Scheduler. DPBCScheduler uses the principle of job matching to improve the scheduling performance. During job matching analysis process, it uses dynamic priority strategy to achieve real-time updates of the match extent. At the same time, job combination strategy is added, letting the priority strategy used inside a jobgroup, not the entire job queue, to reduce the burden of scheduling. After the final coding and testing, the improved algorithm successfully achieves the desired goals, bringing a huge improvement to the performance of the system.
Keywords/Search Tags:Cloud Computing, Distributed, Hadoop, MapReduce, Scheduling algorithm
PDF Full Text Request
Related items