The Design And Implementation Of Log Real-time Analysis System Based On ELk Stack And Spark

Posted on:2022-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:H H Ma

Full Text:PDF

GTID:2518306740983199

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The continuous development of social informatization and Internet technology has brought the world into the era of big data.The variety and scale of information are exploding.Services within communication enterprise are becoming more complex,therefore the subsequent numerous logs put a challenge on the traditional log processing architecture.Existing solutions access logs stored on different machines for offline batch processing with limited storage capacity and processing speed.Therefore,to manage the massive service access and SDWAN controller logs from production environment in real time,it is urgent to design and implement an effective log analysis system.This thesis designs and implements a real-time log analysis system based on ELK Stack and Spark.First,log collecting module based on Logstash gathers various logs from server cluster.Next,storage module using Elasticsearch persists massive log data while providing searching service.Then,log analysis module based on Spark implements statistical computation and log clustering.Finally,visualization module provides end-users the visual interface in a friendly way via Web service.The work of this thesis mainly includes the following four points:1)The PKM++ clustering model is designed aimed at business logs.In this thesis,the PKM++algorithm is proposed for SDWAN Controller's API logs,which combines PCA with K-Means++that introduces the Knuth shuffle algorithm.the Knuth shuffle algorithm increases the randomness in the initial centroids election to improve the effect of log clustering in log analysis module.2)The random optimization scheme of the RDD key is designed and verified.Based on Spark RDD's partitioning mechanism,the randomized prefix and suffix strategy of the RDD key is proposed,and the effect of this strategy in alleviating data skew is tested by experiment.3)The functional and performance requirements of the log analysis system are analyzed and implemented.the system is divided into four functional modules in view of requirements.The toplevel design and overall architecture of the system are introduced,and the detailed implementation of functional module is elaborated.This system provides full-chained log processing functions from log collection,storage,analysis,and visualization.4)The function and performance of this system are tested.The results show that full-chained operation of this system takes below 2 seconds,meeting real-time constraint with moderate memory consumption;the improved PKM++ model produces silhouette of 0.5232 under the same dataset,increasing by 12% compared to native K-Means++;the randomized prefix and suffix strategy reduces the execution time of Spark applications by 18%,indicating that this strategy alleviates Spark application's degradation when dealing with data skew.

Keywords/Search Tags:

Spark, Log analysis, Distributed processing, Data skew

PDF Full Text Request

Related items

1	Research Of Performance Optimization For Data Skew Based On High-speed Networks
2	Research On Partition Loading Balance Based On Spark Data Skew
3	Research On And Application Of The Solution For Spark Data Skew Scenarios
4	Research Of Data Skew On Spark Based On Imporved Partition Method
5	Design And Implementation Of Network Traffic Analysis System Based On Distributed Architecture
6	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
7	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
8	Spark Task Scheduling With Data Skew And Deadline Constraints
9	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
10	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism