Font Size: a A A

Research And Application Of Distributed System For Processing Massive Logs

Posted on:2015-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:H F JiangFull Text:PDF
GTID:2268330425988864Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with social progress and sustainable development of information technology, huge amounts of data is generated exponentially.Traditional database technology has been unable to meet the requirements of large data storage and computation, so Hadoop technology comes into being.A large amount of log data accumulated in the campus network devices, but could not be fully utilized, then became a burden.Thus distributed processing and analysis system is needed urgently.The author (independently or participatly) completed the following tasks:to improve the efficiency of the entire system, some experimental analysis is conducted from three aspects:data import,data analysis/processing,clustering efficiency.Then corresponding optimization strategies are provided.Optimization strategy for data import efficiency is to replace some ACK feedback of the pipeline to self-test, and for maintaining the integrity and reliability of data,the data retransmission method is updated, by maintaining a received packet dataqueue and a table storing received packet id on each datanode.For optimizing data processing efficiency, the efficiency is test and compared under three groups of related parameters.And suitable value range of each parameter from this test cases is got.Optimization strategy for clustering efficiency is to add a Mapper input buffer and strengthen Task scheduler localization,that is, the allocation of map tasks assigned to each node for each iteration is same to the first one.In this paper,the cloud computing framework-Hadoop is applied to the analysis of campus network log.Then a analysis system for access log from users of campus network based on improved Hadoop is designed and developed.From kinds of log,the billing log is selected, which is closely related to user behavior in the campus network.By the log attribute:access time, the twelve-dimensional time feature vector of Internet users is extract.With regard to clustering algorithms,the simple K-MEANS algorithm is chose,which is widely used in the practical application.The Hadoop implementation of it in Mahout library is selected. In the article conclusion,clustering results are displayed after statistical analysis, and the results of various optimization strategies are compared.
Keywords/Search Tags:Hadoop, Billing Log, Data Import, Data Process, K-MEANS Iteration, Efficiency Optimization
PDF Full Text Request
Related items