The Design And Implementation Of Log Statistics Analysis System Based On Hadoop

Posted on:2014-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhu

Full Text:PDF

GTID:2298330452461030

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Internet，the network data is growing exponentially. IDCdata shows that the global enterprise data has been increasing at a rate of55%. The bigdata contains enormous commercial values, causing widespread concern in theenterprises. However, big data also brings some problems and difficulties to the datasynchronization, storage, and statistical analysis of data, the existing tools can’teffectively deal with these issues. Google was the first to publicize MapReduce-asystem they had used to scale their data processing needs. Hadoop is an open sourceversion of Google`s MapReduce, and is gradually becoming a core part of thecomputing infrastructure for many web companies. This paper aims to implement aHadoop based Log Statistics Analysis System.Based on the requirements analysis of this system, this paper designs anarthitecture which is based on the hadoop cluster and integrates the data source layer,the storage layper, and the computation layer, and on top of the cluster designs andimplements four functions including logs synchronization, customized statisticalanalysis jobs, task scheduling, and data query.The log synchronization provides collecting, aggregating, and moving log data frovarious sources to the Hadoop cluster, so as to use the distributed storage; In order tomeet the diversified needs of statistical analysis, the customized statistical analysissupports three different types of jobs including: MapReduce, Streaming and Hive; Thetask scheduling provides unified management and scheduling of all the jobs userssubmitted.; The data query provides a variety of ways to search data stored in thecluster.This paper uses a variety of open source technology in Hadoop ecosystem,including Flume NG, Sqoop, HDFS, MapReduce, Hive and HBase, from thecollection and synchronization of data to the computational analysis, and then the finalanalysis and result query, covering the typical flow and technology used in logstatistical analysis using Hadoop.Java and shell are used in this paper, and the development tools include EclipseIDE, VIM, Hadoop eclipse-plugin. This system build Hadoop cluster on many CentOSmachines for storage and compution. Users use the system to synchronize log data, submit analysis jobs, schedule tasks and query result.

Keywords/Search Tags:

Big data, Hadoop, MapReduce, Log statistics analysis

PDF Full Text Request

Related items

1	Design And Implementation Of The Weibo Statistial System Based On Hadoop
2	Research And Implementation Of Real-time Banking Statistics Report Based On Hadoop
3	The Design And Implementation Of A Set Of Mathematical Statistics Functions Based On Hadoop
4	The Design And Implementation Of Online Retailers Data Analysis System Based On Hadoop
5	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
6	The Research And Implementation For College Student Behavior Analysis System Based On Hadoop Technology
7	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
8	Research On Big Data Text Analysis Based On Hadoop Architecture
9	Design And Implementation Of Mass Data Analysis System Based On Hadoop
10	Design And Implementation Of The Data Analysis System Besed On Hadoop