Font Size: a A A

The Design And Implementation Of Log Real-time Analysis System Based On ELk Stack And Spark

Posted on:2022-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:H H MaFull Text:PDF
GTID:2518306740983199Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The continuous development of social informatization and Internet technology has brought the world into the era of big data.The variety and scale of information are exploding.Services within communication enterprise are becoming more complex,therefore the subsequent numerous logs put a challenge on the traditional log processing architecture.Existing solutions access logs stored on different machines for offline batch processing with limited storage capacity and processing speed.Therefore,to manage the massive service access and SDWAN controller logs from production environment in real time,it is urgent to design and implement an effective log analysis system.This thesis designs and implements a real-time log analysis system based on ELK Stack and Spark.First,log collecting module based on Logstash gathers various logs from server cluster.Next,storage module using Elasticsearch persists massive log data while providing searching service.Then,log analysis module based on Spark implements statistical computation and log clustering.Finally,visualization module provides end-users the visual interface in a friendly way via Web service.The work of this thesis mainly includes the following four points:1)The PKM++ clustering model is designed aimed at business logs.In this thesis,the PKM++algorithm is proposed for SDWAN Controller's API logs,which combines PCA with K-Means++that introduces the Knuth shuffle algorithm.the Knuth shuffle algorithm increases the randomness in the initial centroids election to improve the effect of log clustering in log analysis module.2)The random optimization scheme of the RDD key is designed and verified.Based on Spark RDD's partitioning mechanism,the randomized prefix and suffix strategy of the RDD key is proposed,and the effect of this strategy in alleviating data skew is tested by experiment.3)The functional and performance requirements of the log analysis system are analyzed and implemented.the system is divided into four functional modules in view of requirements.The toplevel design and overall architecture of the system are introduced,and the detailed implementation of functional module is elaborated.This system provides full-chained log processing functions from log collection,storage,analysis,and visualization.4)The function and performance of this system are tested.The results show that full-chained operation of this system takes below 2 seconds,meeting real-time constraint with moderate memory consumption;the improved PKM++ model produces silhouette of 0.5232 under the same dataset,increasing by 12% compared to native K-Means++;the randomized prefix and suffix strategy reduces the execution time of Spark applications by 18%,indicating that this strategy alleviates Spark application's degradation when dealing with data skew.
Keywords/Search Tags:Spark, Log analysis, Distributed processing, Data skew
PDF Full Text Request
Related items