Font Size: a A A

High-dimensional Data Anomaly Detection Based On Generative Model

Posted on:2022-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:P LvFull Text:PDF
GTID:2518306488466594Subject:Engineering
Abstract/Summary:PDF Full Text Request
Anomaly detection is a fundamental and hence well-studied problem in the field of data mining and machine learning.Its task is to identify the objects that significantly differ from the majority of objects in the data space.Anomaly detection is a key core issue in many fields,such as network security,industrial manufacturing,system management,medical analysis,and marine environment monitoring.In practical applications,anomaly detection often involves a large amount of high-dimensional data.The core of anomaly detection is density estimation whether it is high-dimensional data or multi-dimensional data.In general,normal data is large and consistent with certain distribution,while abnormal data is small and discrete,therefore anomalies are residing in low density areas.Anomaly detection based on deep generative models has made significant progress in both academia and industry.In order to detect abnormal objects in massive high-dimensional data more accurately and efficiently,this paper explores two completely different generative models,and proposes two anomaly detection methods based on different generative models.The research content is as follows:In recent years,with the widespread popularity of various types of smart mobile devices,broad applications such as social networks,online shopping,mobile payment,and location services have continued to emerge.Various types of big data have been collected and processed,and mining and analysis services for these big data has suddenly become a unique emerging industry.As one of the most important tasks of data mining,anomaly detection is considered to be of vital importance in various applications such as network monitoring and credit card fraud.In addition,the distributions of the data tend to be skewed in the real world,and local outlier factor effectively addresses the problem of outlier detection in skewed datasets,which has been shown remarkable detection performance in variety of applications.Therefore,local outlier detection has received more and more attention in both academia and industry.In order to more efficiently and quickly detect abnormal objects in large amounts of data,two density-based local outlier detection algorithms are proposed in this thesis.The main research content is as follows:(1)The thesis proposes a layer-constrained variational autoencoding kernel density estimation model(LAKE)for anomaly detection from high-dimensional data.LAKE mainly consists of two parts:the compression network and the KDE model.First,the compression network obtains a low-dimensional representation while retaining the key features using a layer-constrained variational autoencoder.Second,the KDE model takes the low-dimensional representation and reconstruction error features as feeds,and learns a probability density distribution of training samples.Finally,for each test data,its density value is estimated by the trained KDE model of training samples,and the objects with the lowest KDE values are reported as anomalies.Our experimental results on six public benchmark datasets show that the proposed LAKE is significantly better than the state-of-the-art methods by up to 37% improvement on the standard F1 score.(2)The thesis proposes an effective Anomaly Detection model based on Autoregressive Flow(ADAF).The key idea is to unify the distribution mapping capability of flow-based models with the neural density estimation power of autoregressive models.First,we design an autoregressive flow-based model to infer the latent variables of input data by minimizing the combination of latent error and neural density.Second,the neural density of input data can be estimated naturally by ADAF,along with the latent variable inference,rather than through an additional stitched density estimation network.Finally,nlike stitching decoupled models,ADAF optimizes the same network parameters simultaneously by balancing latent error and neural density estimation in a unified training fashion to effectively separate the anomalies out.Experimental results on six public benchmark datasets show that,ADAF achieves better performance than state-of-the-art anomaly detection techniques by up to 20%improvement on the standard F1 score.
Keywords/Search Tags:Outlier detection, High-dimensional data, Generative model, Variational autoencoder, Kernel density estimation, Flow model
PDF Full Text Request
Related items