Font Size: a A A

Study On Outlier Mining Algorithms Based On Clustering

Posted on:2011-06-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X K SuFull Text:PDF
GTID:1118330332486344Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Identifying rare, special abnormal data is more valuable than identifying the normal data in many application fields, such as the network security, credit card fraud detection, disaster weather prediction and so on. The abnormal data are obviously different from the routine observations and may contain very important information. They represent an aberration or the beginning of a new pattern which can bring researchers a new perspective and lead to the appearance of new theories or new applications. From the data analysis point of view, recognizing the abnormal data can be considered as the outlier mining. It provides new methods of analyzing all kinds of massive, complex data with noise. With the development of application, the size and the dimension of data for outlier mining is increasing rapidly. The existing algorithms are mainly used for the small size dataset, difficult to adapt to the large-scale dataset.In the field of pattern recognition outlier mining can be taken as a special kind of classification problem. Unsupervised classification method--clustering is an important data analysis and processing technique as a branch of pattern recognition. The characteristic which does not require any priori knowledge is more suitable for large-scale dataset. This leads to the study on clustering-based outlier mining and many corresponding algorithms are put forward. However, there still exist many problems that should be further studied. In this thesis, we study the problem from different aspects. The major work of this thesis can be summarized as follows:Because the clustering technology lays the foundation of high performance outlier mining the issue of outlier mining base on the arbitrary shape clustering is explored firstly. Most of the existing outlier mining algorithms based on the clustering take the small size clusters as the outliers directly. Due to the restriction of spherical clustering some small size clusters may be the boundary of arbitrary shape normal clusters. In order to improve the accuracy of mining results it is very important to study arbitrary shape clustering for outlier mining. Two kinds of arbitrary shape clustering algorithms OBASC and EASSC are introduced. The algorithm OBASC can find arbitrary shape clusters in the small-scale dataset and input one threshold value parameter only. An inter-cluster dissimilarity measure and a similarity degree definition between an object and a cluster are proposed. The algorithm EASSC presents an inter-cluster similarity measure according to the improved Gaussian kernel function, capable of handling large-scale and high-dimensional dataset with different density. Proposed algorithms are the basis for designing efficient outlier mining algorithm. They can be taken as the pre-processed step. The dataset is clustered firstly according to the arbitrary shape clustering algorithm and the global outliers are found in term of the result clusters. However it is difficult to determine the parameters because appropriate value is gained through repeated probing. An outlier mining algorithm CEBOM based on the clustering ensemble is introduced in order to reduce the reliance for users and decrease the high false positive rate due to taking the small size clusters as the outliers directly. Outliers can be found according to the abnormal frequency of every record. The algorithm is able to provide the user a more "friendly" operation. The experimental results on the synthetic and real-life datasets show that the proposed algorithms are feasible and effective comparing to other classical algorithms and can be used for mixed dataset.Though the outlier mining algorithms based on the arbitrary shape clustering can identify the outliers exactly the time complexity is relatively high. When dealing with large-scale dataset the algorithm need frequent internal and external data exchange, resulting in intolerable time complexity. In view of the time efficiency two outlier mining algorithms ICBOM and SNNOM based on the incremental clustering are proposed for large-scale dataset with mixed attributes. ICBOM introduces and discusses the concept of outlier cluster which can effectively identify the boundary outliers and the internal outliers. The algorithm firstly filters out a large number of normal data via partitioning the dataset into several clusters. Outliers are then detected from the cluster set. At the same time the parameter values are discussed. The algorithm SNNOM calculates the shared nearest neighbor similarity measure between result clusters caused by the incremental clustering. It can not only find the arbitrary shape clusters but also identify the global outlier in large-scale and high-dimensional dataset with different density. Theoretical analysis and. experimental results show presented approaches have nearly linear time complexity with the number of attributes and the size of dataset which results in good scalability. They outperform the existing methods on identifying meaningful and interesting outliers.The outliers identified by the algorithms CEBOM,ICBOM and SNNOM are global outliers. However the reality is complicated and changes often. The collected datasets are always incomplete, especially under the circumstance of the dynamic data stream with the temporal characteristics. The user only concerns with the local instability. Outlier mining in data stream poses great challenges due to the limited memory availability and real time detection requirement. Based on the thought of "on-line clustering, off-line outlier mining", two kinds of outlier mining algorithms DMDSOM and SWMSOM based on different models are introduced for data stream. The algorithm DMDSOM clusters the data stream incrementally based on the damped model and generates the cluster features on behalf of the data distribution. The radius threshold value changes dynamically. When detection requirement is received the outlier factor of specified clusters is calculated and the clusters with high outlier factor are taken as the abnormal clusters which is a time-saving strategy. At the same time the method is proposed to distinguish between the abnormal cluster and the initial stage of data evolution. The algorithm SWMSOM is based on the sliding window model and clusters the data stream through the macro-cluster and the micro-cluster with the time stamp. The collection of statistical information representing the data distribution is maintained dynamically. The multi-granularity deviation factors of the candidate abnormal macro clusters are calculated and they are sorted in descending order, finally the macro clusters with the highest deviation degree are reported to be the true abnormal clusters uncovered. A method determining the radius value of multi-granularity deviation factor is presented. Theoretical analysis shows the time complexity and the space complexity of DMDSOM and SWMSOM are nearly linear with the size of data stream. The experimental results on the KDDCUP99 dataset demonstrate that proposed methods can effectively detect the local outliers in mixed data streams and are the useful supplement and improvement to existing algorithms.Network security becomes more and more important with the rapid development of computer networks. Intrusion detection is regarded as one of applications about outlier mining, normal behavior and intrusion behavior separate from each other, so clustering technique can be used. Firstly an intrusion detection algorithm based on the semi-supervised artificial immune clustering is introduced. It combines the guidance performance of the semi-supervised learning which only uses a small amount of data marked with the artificial immune model which can be used for compressing data and extracting the features. The algorithm clusters the training set and marks the result clusters as normal behaviors or intrusion behaviors in accordance with a small amount of data marked. By this means the classification model is established. The intrusion behaviors can be detected based on the classification model. However in the process of clustering frequent cloning and mutation are operated. The time complexity is very high. While the network data grows dynamically and can be seen as the data stream due to the feature of high-speed and infinite arrival. A weighted fuzzy clustering-based intrusion detection algorithm for KDDCUP99-10% network intrusion data stream is presented. The algorithm gains the statistical information in data stream by the incremental clustering. The number of clusters for fuzzy clustering changes dynamically according to proposed weighted fuzzy cluster feature. The normal behaviors and intrusion behaviors can be separated according to the principle of the maximum degree. Theoretical analysis and experimental results show the algorithms can detect the intrusion behaviors effectively and handle all types of network data and have good application prospects.In this thesis, we make the effective exploration and attempt to put forward some innovative solutions of several key issues on outlier mining based on the clustering. Theoretical analysis and experimental results show that our proposed algorithms can solve the corresponding issues effectively and efficiently, and are great complementarities and improvement to the existing methods. The research not only offers a new view and means for outlier mining, but also enriches the research of pattern recognition.
Keywords/Search Tags:Outlier mining, clustering, pattern recognition, inter-cluster similarity measure, large-scale dataset, data stream, intrusion detection
PDF Full Text Request
Related items