Research&Improvement Of Processing Massive Log Data’s Model Based On Hadoop

Posted on:2014-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:X Cao

Full Text:PDF

GTID:2248330398995281

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the Internet scale keeps growing up, enormous data needs to be processed in manyInternet Service Providers. MapReduce is a framework originally designed by Google to exploitlarge clusters to perform parallel computations. It is based on an implicit parallel programmingmodel that provides an easy and convenient way to express certain kinds of distributedcomputations, particularly those that process large data sets. MapReduce is designed for buildinglarge commodity cluster, which consists of thousands of nodes by using commodity hardware.This programming model hides to the programmer all the complexity involved in management ofparallelization. The programmer only has to implement the computational function that willexecute on each node and the function to combine the intermediate results to generate the finalresult of the application.Hadoop is an implementation of the MapReduce. It is an open source cluster platform andruns on a cluster of commercial PCs. It has areadly been used by many famous IT companies.Hadoop can be taken as the most popular open source cloud computing software. However,Hadoop is still a young platform so that there are lots of key fields to be improved.The researchand achievement of this thesis list as follows:1) Firstly, this paper introduces background knowledge of the research topic, such as thesignificance of the research, Hadoop platform and its related technologies. Besides that，thispaper also includes the overall framework of open source cloud computing platformMapReduce.This paper mainly overs two parts of hadoop——MapReduce and HDFS,whichanalyzes the source code and procedure of MapReduce.2) On the foundation of a detailed analysis of three default scheduling algorithms whichare FIFO, Capacity scheduler and Fair scheduler, including design idea of algorithm, a featuresand implementation, our hierarchical scheduling algorithm based on red-black tree wasproposed and the needed data structure, design idea and algorithm flow are described in detail.3) In order to successfully process massive log datas, a Hadoop cluster system was built. This system can help programmers develop distributed parallel programs without understandingthe low-level details of Hadoop.4) Finally, In order to verify the rationality and effectiveness of the proposed jobscheduling algorithm, this paper designs two groups of tests and records the systemperformance. The results, compared with the existing scheduling algorithm, show that ouralgorithm can better improve CPU utilization and the average response time than the traditionalscheduling algorithm.In the end of this paper, the thesis is summarized and predicted.

Keywords/Search Tags:

Hadoop, Job scheduling, Red-black tree

PDF Full Text Request

Related items

1	The Research And Inplementation Of Tree Species Retrieval System Based On Hadoop
2	Job-aware Network Scheduling For Hadoop Cluster
3	Design And Implementation Of A Black Field Analysis Module Of MPEG-2 Video Decoder
4	Research On Hadoop Cluster Scheduling Optimization
5	Research And Improvement Of Job Scheduling Algorithms On Hadoop Platform
6	Optimization And Research Of Hadoop Scheduling Algorithm In Hadoop Heterogeneous Environment
7	The Study And Optimization Of RTDB Index Structure Based On The Red-Black Tree Balance Mechanism
8	Research Program Based On The Optimization Of The Indexing Mechanism Of Red-black Tree Embedded Database Sqlite
9	Design And Optimization Of RFID Tag File System Based On Red-Black Tree
10	Research Of Job Scheduling Technology In Hadoop Platform