Font Size: a A A

Research&Improvement Of Processing Massive Log Data’s Model Based On Hadoop

Posted on:2014-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:X CaoFull Text:PDF
GTID:2248330398995281Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the Internet scale keeps growing up, enormous data needs to be processed in manyInternet Service Providers. MapReduce is a framework originally designed by Google to exploitlarge clusters to perform parallel computations. It is based on an implicit parallel programmingmodel that provides an easy and convenient way to express certain kinds of distributedcomputations, particularly those that process large data sets. MapReduce is designed for buildinglarge commodity cluster, which consists of thousands of nodes by using commodity hardware.This programming model hides to the programmer all the complexity involved in management ofparallelization. The programmer only has to implement the computational function that willexecute on each node and the function to combine the intermediate results to generate the finalresult of the application.Hadoop is an implementation of the MapReduce. It is an open source cluster platform andruns on a cluster of commercial PCs. It has areadly been used by many famous IT companies.Hadoop can be taken as the most popular open source cloud computing software. However,Hadoop is still a young platform so that there are lots of key fields to be improved.The researchand achievement of this thesis list as follows:1) Firstly, this paper introduces background knowledge of the research topic, such as thesignificance of the research, Hadoop platform and its related technologies. Besides that,thispaper also includes the overall framework of open source cloud computing platformMapReduce.This paper mainly overs two parts of hadoop——MapReduce and HDFS,whichanalyzes the source code and procedure of MapReduce.2) On the foundation of a detailed analysis of three default scheduling algorithms whichare FIFO, Capacity scheduler and Fair scheduler, including design idea of algorithm, a featuresand implementation, our hierarchical scheduling algorithm based on red-black tree wasproposed and the needed data structure, design idea and algorithm flow are described in detail.3) In order to successfully process massive log datas, a Hadoop cluster system was built. This system can help programmers develop distributed parallel programs without understandingthe low-level details of Hadoop.4) Finally, In order to verify the rationality and effectiveness of the proposed jobscheduling algorithm, this paper designs two groups of tests and records the systemperformance. The results, compared with the existing scheduling algorithm, show that ouralgorithm can better improve CPU utilization and the average response time than the traditionalscheduling algorithm.In the end of this paper, the thesis is summarized and predicted.
Keywords/Search Tags:Hadoop, Job scheduling, Red-black tree
PDF Full Text Request
Related items