Font Size: a A A

Research On Outliers Detection In Data Stream Based On Unsupervised Learning

Posted on:2019-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z R LiFull Text:PDF
GTID:2428330611993321Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,along with the rapid development of Internet technology and the continuous popularization of mobile devices,our life has entered the era of big data.In more and more fields related to people's production and life,the data has been static and fixed before.The characteristics are instead being generated in the form of data streams.However,the characteristics of the stream data itself(such as lack of tags,continuous arrival,dynamic changes,etc.)make the original method for static data sets not well applied to the anomaly detection in the data stream environment,so the unsupervised learning of massive data is based on The anomaly detection has great research significance and application value.However,although there are many unsupervised data flow anomaly detection methods at present,these technologies have their own defects and it is difficult to balance all data types.They often only play better performance on some specific data sets,which is difficult to perfect.The ground meets the actual needs of various data types.After comparing the large number of data sets and comparing the performance of the current mainstream methods on these data sets,the data can be roughly divided into dimensionally weak correlation data with little correlation between dimensions,no too many“noise dimensions",and data dimensions.Strong correlation data with large correlations,complex distributions,and possibly a large number of "noise dimensions".Based on this,this paper designs a reasonable and effective data type recognition algorithm based on different forms of data.According to the difference of different data and processing requirements,two anomaly detection methods suitable for different types of data sets are proposed,and they all have significant advantages in comparison with similar methods.In addition,this paper also designed a general multi-class data flow anomaly detection system that integrates these two methods and automatic selection strategy to achieve fast and accurate outlier detection for all types of data.Because the current data flow anomaly detection technology based on unsupervised learning has its own defects,it is difficult to balance all data types.Therefore,this paper designs a data automatic classification and method selection strategy ADCS based on correlation matrix.ADCS first analyzes the correlation of the internal dimensions of the data and builds the correlation matrix C in turn.Then ADCS performs singular value decomposition on matrix C,and uses the idea of dimensionality reduction to find the K-order approximate matrix CK whose matrix C meets the requirements.It is then determined whether the data belongs to dimensionally strong correlation data or dimensionally weakly correlated data by comparing whether the ratio ? of the dimension of CK to the number of dimensions of C exceeds a given threshold.In the case of online anomaly detection of strong correlation data with large correlation between data dimensions and more noise information,it is often necessary to construct an efficient and accurate model.An ideal anomaly detection model should meet the following three requirements:1.The model needs to accurately simulate the data distribution under the condition of only a small amount of storage space;2.The model should have a credible and effective anomaly detection strategy;3.Model It should be possible to implement its own updates by only reading the data once to accommodate the changing distribution in the data stream.Although there are many studies on data flow anomaly detection technologies,none of these solutions can fully satisfy all of the above requirements.Therefore,this paper proposes a novel matrix-based anomaly detection framework NODF-MaS to solve this problem,and achieves a huge improvement in the accuracy and efficiency of anomaly detection.Specifically,NODF-MaS uses data-dependent multi-view segmentation technology to accurately map data distribution,and designs a distributed detection system and integration method to ensure the accuracy of prediction.In addition,NODF-MaS uses matrix profiling techniques to reduce computational costs and increase response speed.Experiments show that NODF-MaS not only exceeds its comparison algorithm in terms of response speed,but also ensures high-precision outlier detection.Compared with the existing algorithms,the operating speed of NODF-MaS is increased by 32%?60%,and the detection accuracy is always kept at a high level.In some data sets with more complex data distribution,it even reaches 3 times of other algorithms.All current mainstream algorithms are difficult to complete.For dimensionally weakly correlated data with low correlation between data dimensions and no too much noise information,the distance-based method is very good in terms of accuracy and response speed,and is simple and intuitive.Combined with a specific pre-processing process can also be extended to many areas.However,the current distance-based algorithm still has a fatal flaw.The current method mostly adopts the form of“sliding window" to complete the incremental update of the model,which makes the anomaly detection model difficult to adapt to the distribution of dynamically changing data streams.To solve this problem,this paper proposes an anomaly detection method FROD based on active interior point model and micro-cluster structure.Specifically,FROD dynamically selects representative data objects for retention using the active interior point model AIP for subsequent outlier analysis.An effective micro-cluster-based data storage structure and its update method are proposed to maintain the data in AIP to improve detection efficiency.In addition,the paper also analyzes the time complexity of FROD and theoretically proves the superiority of FORD over similar methods.Experiments show that FROD is not only more adaptable to the dynamic changes of data distribution,but also in the same detection accuracy,FROD can lead the other methods by one to two orders of magnitude.In order to further verify the theoretical research results of this paper,this paper designs and implements a general multi-type data flow anomaly detection prototype system GMSOD based on Storm,an excellent distributed stream processing platform.GMSOD consists of data type identification module,dimensional strong correlation data detection module and dimensional weak correlation detection module,and adopts"dispatch-aggregation" mechanism as the logical architecture of two distributed data flow anomaly detection algorithms.Experiments show that the traditional method can only play better performance on some specific data sets,while GMSOD can automatically switch the algorithm to maintain the accuracy of detection and always exceed the other methods in data processing speed.It can be seen that GMSOD can better adapt to various types of data environments in the data stream and always exhibit excellent anomaly detection performance.
Keywords/Search Tags:Data Stream, Outliers Detection, Concept Drift, Data Mining, Matrix Sketch, High Performance Computing
PDF Full Text Request
Related items