Font Size: a A A

Research On Outliers Mining Algorithm Based On Data Streams With Different Attributes

Posted on:2011-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q H WuFull Text:PDF
GTID:2178360302994579Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
By analyzing data streams outliers mining situation of foreign and domain, we found that there exist many problems in the previous algorithms for detecting outliers. These algorithms may be inefficient in dealing with the following problems. Firstly, most existing algorithms of outliers mining often ignore the categorical attributes for heterogeneous data streams. Secondly, simple algorithms of detecting outliers over categorical data streams do not adopt the reasonable weight; the detected outliers are deviated from the real. The solving of these problems has important meaning for financial fraud detection, network intrusion detection, weather forecast and other risk control areas.Firstly, we propose an efficient outlier detection algorithm based on heterogeneous data streams, which partitions the stream in chunks. Then each chunk is clustered and the corresponding clustering results are stored in cluster references. The representation degree and the number of adjacent cluster references of each cluster reference are computed to generate the final outlier references, which include potential outliers. For detecting heterogeneous data streams based outliers, HDSOD is effective when the memory is limited.Secondly, a novel approach for detecting categorical data streams based outliers CFPOD-Stream is proposed. The algorithm define weighted closed frequent pattern outlier factor to measure the complete transactions, through discovering and maintaining closed frequent patterns, the outlier measure of each transaction is computed to generate outliers. In addition, we adopt a fading query indexed structure to solve the problem of concept drift, so as to detect outliers efficiently.Lastly, the efficient outlier detection method is used in software vulnerability analysis. CFPOD-Stream is made some corresponding improvements to determine outliers. The outliers are the data that contain relatively less closed frequent itemsets. To describe the reasons why detected outlier transactions are infrequent, the contradictive closed frequent patterns for each outlier are figured out.We implement the above three algorithms with language of C++. All of our experiments are performed on the real life dataset KDD-CUP-99 and synthetic dataset to execute the algorithms this paper presented. The experimental results show the feasibility and effectiveness of our algorithms.
Keywords/Search Tags:Data stream, Outlier detection, Heterogeneous attributes, Closed frequent itemsets, Sliding window
PDF Full Text Request
Related items