Research And Application Of Distributed System For Processing Massive Logs

Posted on:2015-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:H F Jiang

Full Text:PDF

GTID:2268330425988864

Subject:Computer Science and Technology

Abstract/Summary:

Along with social progress and sustainable development of information technology, huge amounts of data is generated exponentially.Traditional database technology has been unable to meet the requirements of large data storage and computation, so Hadoop technology comes into being.A large amount of log data accumulated in the campus network devices, but could not be fully utilized, then became a burden.Thus distributed processing and analysis system is needed urgently.The author (independently or participatly) completed the following tasks:to improve the efficiency of the entire system, some experimental analysis is conducted from three aspects:data import,data analysis/processing,clustering efficiency.Then corresponding optimization strategies are provided.Optimization strategy for data import efficiency is to replace some ACK feedback of the pipeline to self-test, and for maintaining the integrity and reliability of data,the data retransmission method is updated, by maintaining a received packet dataqueue and a table storing received packet id on each datanode.For optimizing data processing efficiency, the efficiency is test and compared under three groups of related parameters.And suitable value range of each parameter from this test cases is got.Optimization strategy for clustering efficiency is to add a Mapper input buffer and strengthen Task scheduler localization,that is, the allocation of map tasks assigned to each node for each iteration is same to the first one.In this paper,the cloud computing framework-Hadoop is applied to the analysis of campus network log.Then a analysis system for access log from users of campus network based on improved Hadoop is designed and developed.From kinds of log,the billing log is selected, which is closely related to user behavior in the campus network.By the log attribute:access time, the twelve-dimensional time feature vector of Internet users is extract.With regard to clustering algorithms,the simple K-MEANS algorithm is chose,which is widely used in the practical application.The Hadoop implementation of it in Mahout library is selected. In the article conclusion,clustering results are displayed after statistical analysis, and the results of various optimization strategies are compared.

Keywords/Search Tags:

Hadoop, Billing Log, Data Import, Data Process, K-MEANS Iteration, Efficiency Optimization

Related items

1	Research On Import Business Process Optimization Of H Processing Trade Enterprises
2	Research On Hadoop Based Iterative Data Processing And Data Placement Strategy
3	Study On The Robust Optimization Of HADOOP Under The Restriction Of Cluster Computing Efficiency
4	Research On Web Log Data Analysis System Based On Hadoop
5	Design And Implementation Of Data Import And Preprocessing System
6	Research And Application Of Hadoop Distributed Clustering Mining Method Based On Virtual Machine
7	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout
8	Research On The Cleaning Method Of Industrial Big Data Based On Hadoop
9	Research On The Application Mode Of Cloud Computing In The Billing System Of Telecom Operators
10	Design And Implementation Of Video Logs Analysis System Based On Hadoop