Font Size: a A A

A Log Based Augmented Distributed Clustering Algorithm

Posted on:2018-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:X W ChenFull Text:PDF
GTID:2348330512983116Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of cloud computing,large-scale online service system has drawn more and more public attention,a variety of large-scale online service system showed up in the market one after another.However,with the ever-increasing complexity of the service system,the log data generated by the large online service system reaches the TB level and requires 99.99% availability for 24 hours.The correctness and efficiency of verification of large-scale online service system is very important for large online service systems.The correctness and efficiency of verification is directly relate to the stability of large-scale online service systems in real production environments,which are of great significant in large-scale online service system.Log data plays an important role for the analysis of large-scale online service system.The maturity of the log data analystic technology will greatly contribute to the stable operation and efficient maintenance of large online service systems.Efficient log analysis technology has become the difficulties and hot spots where researchers are widely into about verification and debugging of largescale online service system.Log clustering analysis is to reduce the human power cost of development and maintenance of large online service systems and to help engineers and researchers of large online service systems solve problems and ensure system robustness.In this thesis,the research topic is the clustering analysis of log and its efficiency optimization,focusing on the log data distribution and analysis,clustering analysis technology,large data processing and analysis technology.This thesis proposed the Cascading-clustering algorithm,a distributed clustering algorithm based on distributed systems,for processing the large and long tail distribution of massive log data.The algorithm is implemented using MapReduce technology.The feasibility and validity of the algorithm are been verified by experiments,and the Cascading-clustering algorithm is integrated into the log analysis system.The main contents and contributions of this thesis are included1)Study and analyze the characteristics of log data and the characteristics of long tail distribution of log data.A large number of similar log data belong to the normal operation of the system log data,and can reflect the system anomalies and error information distributed in the "long tail".Proposed to deal with massive log data,the log data sampling,analysis,conversion and so on.2)Based on the study of various clustering algorithms and the advantages and disadvantages of various clustering algorithms,combined with the long tail distribution of log data,this thesis proposes a distributed algorithm based on distributed Cascading clustering algorithm.3)The experiment of Cascading clustering algorithm in log data shows the feasibility of Cascading clustering algorithm in running time and space,and demonstrate the superiority of log data under long tail distribution,and the reduction workload through Cascading log clustering algorithm is been illustrated by experiments.4)Establish a log analysis system,including the history of the sample library,user interface,the core algorithm and functional modules.The Cascading-clustering algorithm is been integrated into the log analysis system.I declare that the submission of this thesis is my work completed during the exchange of Microsoft Research Asia,the corresponding intellectual property rights belongs to Microsoft.
Keywords/Search Tags:Clustering analysis, log analysis, distributed computing, data mining
PDF Full Text Request
Related items