Font Size: a A A

The Design And Implementation Of Log Statistics Analysis System Based On Hadoop

Posted on:2014-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhuFull Text:PDF
GTID:2298330452461030Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet,the network data is growing exponentially. IDCdata shows that the global enterprise data has been increasing at a rate of55%. The bigdata contains enormous commercial values, causing widespread concern in theenterprises. However, big data also brings some problems and difficulties to the datasynchronization, storage, and statistical analysis of data, the existing tools can’teffectively deal with these issues. Google was the first to publicize MapReduce-asystem they had used to scale their data processing needs. Hadoop is an open sourceversion of Google`s MapReduce, and is gradually becoming a core part of thecomputing infrastructure for many web companies. This paper aims to implement aHadoop based Log Statistics Analysis System.Based on the requirements analysis of this system, this paper designs anarthitecture which is based on the hadoop cluster and integrates the data source layer,the storage layper, and the computation layer, and on top of the cluster designs andimplements four functions including logs synchronization, customized statisticalanalysis jobs, task scheduling, and data query.The log synchronization provides collecting, aggregating, and moving log data frovarious sources to the Hadoop cluster, so as to use the distributed storage; In order tomeet the diversified needs of statistical analysis, the customized statistical analysissupports three different types of jobs including: MapReduce, Streaming and Hive; Thetask scheduling provides unified management and scheduling of all the jobs userssubmitted.; The data query provides a variety of ways to search data stored in thecluster.This paper uses a variety of open source technology in Hadoop ecosystem,including Flume NG, Sqoop, HDFS, MapReduce, Hive and HBase, from thecollection and synchronization of data to the computational analysis, and then the finalanalysis and result query, covering the typical flow and technology used in logstatistical analysis using Hadoop.Java and shell are used in this paper, and the development tools include EclipseIDE, VIM, Hadoop eclipse-plugin. This system build Hadoop cluster on many CentOSmachines for storage and compution. Users use the system to synchronize log data, submit analysis jobs, schedule tasks and query result.
Keywords/Search Tags:Big data, Hadoop, MapReduce, Log statistics analysis
PDF Full Text Request
Related items