Font Size: a A A

Research On Algorithms For Outlier Detection

Posted on:2013-02-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:M L YangFull Text:PDF
GTID:1118330371480810Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
An outlier is a data point which deviates significantly from the majority of the main stream data points in a dataset, and it is often more valuable than that of the main stream data. Outlier detection finds extensive use in a wide variety of applications, thus has received increasing attention from researchers in diverse disciplines such as data mining, machine learning, statistics and information theory research community. Over the years, a variety of outlier detection algorithms based on different techniques have been specifically developed for certain type of data sets. But reducing the time complexity for outlier detection from large high-dimensional datasets still remains unresolved. A key advantage of nearest neighbor based techniques is that they are unsupervised and do not make any assumptions with regard to the distribution of the dataset. While the bottleneck is that a nearest-neighbor search is required for each of the data points, resulting in O(n2) time complexity, which restricts the ability to apply this approach to large high-dimensional dataset. Thus designing effective and efficient algorithms for mining outliers from large high-dimensional dataset is of great theoretical and practical importance.By pruning non-outliers to reduce the size of the target data set is an effective means to reduce the time complexity. If the r-neighborhood of a given point contains more than k neighbors, the data point is not a DB(k, r) outlier. Furthermore, if the r/2- neighborhood of a given point contains more than k neighbors, then all points in that neighborhood are not DB(k, r) outliers. Applying this pruning rule can exclude a large number of non-outliers, and thus greatly reduce the size of the target data set. This can be done in just single pass over the data set:Pick up the first point as the first group center, then sequentially scan data set, for each point, if there exist some group center such that the distance between the point and the center is less than r/2, the point is marked that group. If there is no such group center exsits, the point becomes a new group center. After grouping, all data points in the group which has more than k members are pruned.Avoiding unnecessary nearest neighbor searching when detect TOP n outliers is the second effective means to reduce the time complexity:Keep track of the smallest average k-distance of the n outlier candidates found so far as the threshold. Once a data point achieves a lower average k-distance than the threshold, immediately terminate the k nearest neighbor searching, because the point has no chance to be a TOP n outlier.Sampling can also reduce the data space, thus becomes the third effective means to reduce the time complexity. But uniform random sampling technique is lack of flexibility. Instead, density biased sampling technique overcomes the limitations of uniform random sampling, because it takes a sample of smaller size to represent the same data set, thus achieves more flexibility. If outliers are sampled with higher probability, outlies can be detected in the sample instead of the entire dataset. If data points are density biased sampled, k nearest neighbors can be found in sample instead of the entire dataset. Both sampling strategy reduces the searching spaces. The difference of two distances from two points to the same sample is smaller than the distance from one point to the other. Thus the distance from one point to the other is lower bounded by the difference of two distances. This nature can also be used to replace traditional average k-distance to achieve lower time complexity. Density biased sampling based outlier detection approach only needs three pass over the dataset to identify outliers:after first pass, the density of each point is estimated. After second pass, data points are sampled. After third pass, outleirs are identified.Conditional outlier is a new type of outliers proposed recently, only a few nearest neighbor based algorithms are proposed to detect such outliers. The shortcoming of proposed CODB algorithm is obvious, it has too much paremeters, and the detection result is sensitive to the parameters, resulting less applicability. Improvement is made to remove the parameters, and a new algorithm is proposed.Essentially, the outlier-ness of a data point has nothing to do with the value of the data point, and the distance from one data point to another. Only the attribute probabilities of the data point contribute the outlier-ness. The lower the probability, the more likely it is an outlier. The information entropy is precisely depicts this feature. When a normal data point is removed from the dataset the value of entropy should increase. When an oultier data point is removed, the value of entropy should decrease. The significant the decrease, the more likely it is an outlier. However, information entropy based outlier detecting algorithm proposed so far involves too much calculating. To evaluate the outlier degree of a data point, the entropy has to be calculated twice:before and after the removal of the data point, then the impact is calculated. Calculating entropy is time consuming. It involves calculating the attributes probabilities over each dimension. However, it can be proved that impact of removing a candidate object can be calculated by the value of entropy before removal and the attributes probability of the candidate object. The outlier factor of a data point is only determined by the latter. To this end, this value is defined as the entropy outlier factor. The entropy outlier factor based outlier detection algorithm only needs twice scan over the dataset to identify all outliers, thus the time complexity is further reduced.
Keywords/Search Tags:Outlier detection, Distance based, Density based, K nearest neighbor, Conditional outlier, Density biased sampling, Information entropy
PDF Full Text Request
Related items