Research On System Log Anomaly Detection Based On Sentence Embedding And Anti-noise

Posted on:2023-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Qian

Full Text:PDF

GTID:2558307118999389

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Anomaly detection is an essential tool for large-scale system event management,and its purpose is to detect abnormal system behavior in time.Anomaly detection enables system developers(or operators)to identify and resolve problems timely.Thus it can reduce system downtime.Logging systems are deployed in almost all modern computer systems.System logs record detailed system runtime information to support various system management and diagnostic tasks,such as ensuring application security,identifying performance anomalies,and diagnosing errors and crashes.Abnormal system logs often follow fixed patterns that can be used to diagnose the source of system problems and predict potential system problems.As a result,system logs are widely used as a data source for anomaly detection during the development and maintenance of many systems.System log-based anomaly detection methods have become a research topic of practical importance in academia and industry.However,existing log anomaly detection methods cannot fully utilize the contextual semantic information and parameter information in the system logs,which induces the insufficient recognition accuracy of log-based anomaly detection.Moreover,system logs evolve with the update of the logging system,which changes old logs and causes the emergence of new logs.The changes of system logs introduce noise in the original log data and degrade the performance of the anomaly detection models trained based on the original log data.In this paper,we conduct an in-depth study on these problems,and the main research work is as follows.1)To address the problem of insufficient recognition accuracy caused by log representations that do not fully consider contextual semantic information and ignore log parameters,this thesis investigates an anomaly detection scheme that combines sentence embedding and log parameters.In order to preserve the semantic information of parameter values,the proposed scheme uses parameter labels to replace log parameters,including both text and numeric types.Then it uses sentence embedding to extract the semantic information at the sentence level of the system logs to generate log representations.At the same time,the scheme uses long short-term memory networks to construct an anomaly detection network based on a sequence modeling approach,which combines sentence embedding and numerical parameters to detect log anomalies.The scheme verifies the performance on HDFS(Hadoop Distributed File System)and BGL(IBM Blue Gene/L)log datasets.The experimental comparison analysis with the baseline approach shows that the F1-Score of the scheme performs well compared with current log-based anomaly detection approaches.The ablation experiments confirm the effectiveness of sentence embedding and parameter values in log-based anomaly detection.2)To address the degradation of model performance due to the noise emerging during the evolution of logs,this thesis investigates an anti-noise scheme of log anomaly detection based on contrast loss and text data enhancement.The scheme uses a sentence embedding method based on contrast loss to optimize the log representations.The scheme generates positive pairs using Transformer encoder based on pre-trained language models.The log sentence embedding is optimized by minimizing the contrast loss.After training,the positive pairs are closer in the distance,and the semantic vectors are more uniformly distributed overall.When noise is introduced in the log dataset,the log representation is maintained on the original semantic representation,thus reducing the performance degradation of the anomaly detection model.The scheme also uses a rule-based textual data augment method and an auxiliary log dataset-based data augment method to expand the datasets available for training.The rule-based textual data augment method simulates the changes in the log data during system log updates in advance to help the model cope with the appearance of log noise.The introduction of auxiliary log datasets can increase the Syslog domain knowledge in the log representation and anomaly detection models to improve the robustness of the model and help the model better combat noise.The scheme verifies the improvement to the original model on HDFS dataset and BGL dataset and the resistance to noise on the noisy log dataset.In this thesis,we mainly make some improvements to address the shortcomings of some existing deep learning-based log anomaly detection schemes in two aspects,the lack of recognition accuracy due to the log representation not fully considering contextual semantic information and ignoring log parameters,and the degradation of model performance due to noise brought by log data updates.The experimental results show that this thesis’ s log anomaly detection scheme achieves high performance on the mainstream log data set and has some practical value.

Keywords/Search Tags:

Log-based anomaly detection, Sentence embedding, Contrastive learning, Data augment

PDF Full Text Request

Related items

1	Research On Sentence Representation Based On Contrastive Learning And Deep Neural Network
2	Research On Anomaly Detection Of Log Data Based On Contrastive Learning And Word Embedding
3	Open Intent Detection Based On Prototype Contrastive Learning
4	Detection Method Of Poisoning Attack In Recommender Systems Based On Graph Embedding And Anomaly Detection
5	Research Of Log Anomaly Detection Method Based On Sentence-BERT
6	Unsupervised Sentence Embedding With Prompt Learning And Sample Filter
7	Sentence-embedding And Similarity Via Hybrid Bidirectional-LSTM And CNN Utilizing Weighted-pooling Attention
8	A Research Of Unsupervised Image Anomaly Detection Method With Deep Learning
9	Anomaly Detection Of Multivariate Time Series Data Based On Representation Learning
10	Sentence Representation Learning Based On Redundant Information Filtering Of Pre-training Data