Font Size: a A A

Toward accurate and efficient outlier detection in high dimensional and large data sets

Posted on:2011-04-10Degree:Ph.DType:Thesis
University:Georgia Institute of TechnologyCandidate:Nguyen, Minh QuocFull Text:PDF
GTID:2448390002468429Subject:Computer Science
Abstract/Summary:
Advances in computing have led to the generation and storage of extremely large amounts of data every day. Data mining is the process of discovering relationships within data. The identified relationships can be used for scientific discovery, business decision making, or data profiling. Among data mining techniques, outlier detection plays an important role. Outlier detection is the process of identifying events that deviate greatly from the masses. The detected outliers may signal a new trend in the process that produces the data or signal fraudulent activities in the dataset. This thesis shows that the efficiency and accuracy of unsupervised outlier detection methods for high dimensional tabular data can be greatly improved.;Local outlier factor (LOF) is an unsupervised method to detect local density-based outliers. The advantage of the method is that the outliers can be detected without training datasets or prior knowledge about the underlying process that produces the dataset. The method requires the computation of the k-nearest neighbors for the dataset. The main problem is that the efficiency and accuracy of the indexing method for computing k-nearest neighbors deteriorates in high dimensional data. The first contribution of this work is to develop a method that can compute the local density-based outliers very efficiently in high dimensional data. In our work, we have shown that this type of outlier is present even in any subset of the dataset. This property is used to partition the data set into random subsets to compute the outliers locally. The outliers are then combined from different subsets. Therefore, the local density-based outliers can be computed very efficiently. Another challenge in outlier detection in high dimensional data is that the outliers are often suppressed when the majority of dimensions do not exhibit outliers. The contribution of this work is to introduce a filtering method where outlier scores are computed in sub-dimensions. The low sub-dimensional scores are filtered out and the high scores are aggregated into the final score. This aggregation with filtering eliminates the effect of accumulating delta deviations in multiple dimensions. Therefore, the outliers are identified correctly.;In some cases, the set of outliers that form micro patterns are more interesting than individual outliers. These micro patterns are considered anomalous with respect to the dominant patterns in the dataset. In the area of anomalous pattern detection, there are two challenges. The first challenge is that the anomalous patterns are often overlooked by the dominant patterns using the existing clustering techniques. A common approach is to cluster the dataset using the k-nearest neighbor algorithm. The contribution of this work is to introduce the adaptive nearest neighbor and the concept of dual-neighbor to detect micro patterns more accurately. The next challenge is to compute the anomalous patterns very fast. Our contribution is to compute the patterns based on the correlation between the attributes. The correlation implies that the data can be partitioned into groups based on each attribute to learn the candidate patterns within the groups. Thus, a feature-based method is developed that can compute these patterns very efficiently.
Keywords/Search Tags:Data, High dimensional, Outlier detection, Patterns, Method, Compute
Related items