Discovering and ranking outliers in very large datasets

Posted on:2007-02-08

Degree:M.Sc

Type:Thesis

University:University of Alberta (Canada)

Candidate:Pei, Yaling

Full Text:PDF

GTID:2448390005977626

Subject:Computer Science

Abstract/Summary:

Outlier detection aims to discover exceptional instances in datasets. It has long been studied in the literature of statistics. In recent years, outlier detection has gained much interest in data mining and found many important applications. Current work on outlier detection mainly focuses on three aspects: definition of outliers, efficient methods for finding meaningful outliers and evaluation methodology. In this thesis, we propose a new method that uses the relative degree of density with respect to a set of reference points to estimate the neighborhood density of a data point. Candidate outliers are ranked based on the outlier score that is assigned to each data point. The running time of our reference-based algorithm is O(Rn log n) where n is the size of the dataset and R is the number of reference points. Analysis and experiments show that our method is very effective and highly scalable to very large datasets. To facilitate experimental tests for outlier analysis and automate the generation of diverse datasets, we developed a generic framework for synthetic data generation. The system can efficiently produce datasets with various characteristics such as size, shape, density as well as cluster and outlier distributions.

Keywords/Search Tags:

Outlier, Datasets

Related items

1	Research On Local Outlier Detection Algorithm
2	Research On Outlier Detection Algorithm And Its Application For Large-scale Datasets
3	Research On Density-based Outlier Detection In Multi-dimensional Datasets
4	Study On An Analysis Method For Cluster-based Outlier
5	Outlier Mining Method Based On Gini Indexes And Sub-space Research
6	Dynamic Group Nearest Neighbors And Its Applications In Outlier Detection And Clustering Analysis
7	Research Of Outlier Detection Algorithm Based On Hadoop
8	Research On Outlier Detection Based On Density Difference
9	Research Of Detection Outlier Based On Outlier Degree
10	Anomaly detection in heterogeneous datasets