Font Size: a A A

Discovering and ranking outliers in very large datasets

Posted on:2007-02-08Degree:M.ScType:Thesis
University:University of Alberta (Canada)Candidate:Pei, YalingFull Text:PDF
GTID:2448390005977626Subject:Computer Science
Abstract/Summary:
Outlier detection aims to discover exceptional instances in datasets. It has long been studied in the literature of statistics. In recent years, outlier detection has gained much interest in data mining and found many important applications. Current work on outlier detection mainly focuses on three aspects: definition of outliers, efficient methods for finding meaningful outliers and evaluation methodology. In this thesis, we propose a new method that uses the relative degree of density with respect to a set of reference points to estimate the neighborhood density of a data point. Candidate outliers are ranked based on the outlier score that is assigned to each data point. The running time of our reference-based algorithm is O(Rn log n) where n is the size of the dataset and R is the number of reference points. Analysis and experiments show that our method is very effective and highly scalable to very large datasets. To facilitate experimental tests for outlier analysis and automate the generation of diverse datasets, we developed a generic framework for synthetic data generation. The system can efficiently produce datasets with various characteristics such as size, shape, density as well as cluster and outlier distributions.
Keywords/Search Tags:Outlier, Datasets
Related items