Font Size: a A A

Study On Outlier Detection And Outlying Interpreting Algorithms

Posted on:2013-07-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:D J LeiFull Text:PDF
GTID:1228330392954029Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier data is the data point that appears to be abnormal data patten related tonumerous normal data. Many data mining methods are applied to reduce the influencefrom outlier data or completely clear them out. But this could possibly lead to the loseof some useful information that hides in the outlier data. Outlier detection involvesmany data processing techniques, such as data mining, machine learning, statistics,intelligent computing, visualization technology, to find out the outlier data in data setand the mechanism which produces these outlieer data in order to provide users withdeep analysis of the data.Outlier detection has been an important research direction in the field of datamining. It has achieved great success and been applied to various fields recently,especially in detecting irrational or abnormal data behaviors in data set such as thefinancial fraud detection, network intrusion and anomaly detection, process monitoringand identification, anomaly detection of hyperspectral images, medical abnormalresponse analysis, abnormal signal detection. Consequently, outlier detection andanalysis has important academic significance as well as broad application prospects.However, how to quickly and precisely detect abnormal data and analyze the reasons(outlier interpretation), which lead to anomaly has become a challenging field.This paper studies some relative theories and methods of outlier detection andoutlier definition and verifies them by experiments. The main work and results of thispaper are presented as follows:Firstly, we analyze and study the effects of outlier detection for the correct clusternumber in clustering-based outlier detection algorithms. The algorithm analysisproposed in this paper has two phases, the first one is clustering and the second one isoutlier detection. In the first phase, subtractive clustering method is used to acquirerough estimation for number of real clusters. The cluster validation index is used as thecriterion of clustering estimation. Then we search for optimal number of clusters anduse it for clustering. In the second section, the clustering results combined with the localoutlier factor based on clustering are used for outlier detection. The outlier factor ofeach data is regarded as outlier measure. This algorithm largely improves the effects ofoutlier detection by acquiring the optimal number of clusters.Secondly, this paper proposes a cloud model-based outliers detection algorithm focusing on the inability of the category attribute data set to use continuous numericoutlier detection method. First, the forward cloud generate of cloud model is used totransfer each data record into “cloud drop”. Finally, the corresponding data recorders areregard as outliers by their outlier measure according to the certain values of the cloudmodel to which the “cloud drop” belongs.Focusing on the original study of the outlier interpretation, this paper proposes thatif some attribute subset in full attribute space are found close to the discovered outlierdata in full attribute space, this attribute subset is termed as outlier interpretationsubspace. The outlier interpretation subspace is one aspect of the studies of outlierinterpretation. It can partly explain the reasons of generated outliers. In addition, it candirectly process in outlier subspace for massive data to detect outliers in the future. Dueto the high time complexity of searching for outlier interpretation subspace, this paperproposes an outlier subspace searching algorithm based on power graph with pruningstrategy and an attribute reduction based outlier detection method based on concept ofthe rough set, and verifies its validity by experiments.Focusing on the further analyze for outlier interpretation subspace, this paperproposes a concept of outlier key subspace. The outlier key subspace is similar to thekernel concept in attribute reduction which is the necessary but not necessary andsufficient conditions of generating outlier. This paper proposes a searching algorithm ofoutlier key subspace based on tensor space. The algorithm first regards outlier as center.Then search its nearest neighbor set by sharing nearest neighbor similarity. Finallyexpand a data space by nearest neighbor set and detect local outlier in the attributesubspace of this data space. This algorithm avoids the time cost on searching the wholeentire data space by using tensor space. It also guarantees the accuracy of the algorithmby using shared nearest neighbor similarity in aspect of the overcome of the “curse ofdimensionality”.
Keywords/Search Tags:Outlier Detection, Automatic Clustering, Cloud Model, Outlying InterpretSubspace, High Dimensional Dataset
PDF Full Text Request
Related items