Research And Analysis On Distance-based Outlier Detection

Posted on:2012-01-18

Degree:Master

Type:Thesis

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:2178330338497521

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As an important research area of data mining, outlier detection focuses on a very small percentage of the whole data set. By analyzing the rare objects which obviously deviate from the other, some unexpected and practical knowledge can be discovered. Therefore outlier detection is widely used in intrusion detection, credit fraud, fault diagnosis and etc. Research workers have developed many different algorithms to detect outliers more effectively, which can be reduced to distance-based, statistic-based, density-based, depth-based, deviate-based algorithm and etc. And the distance-based algorithm has important significance both in theory and practice for it is flexible to customize a distance function for effective outlier detection. While the current research in distance-based outlier detection still has some shortage in actual application, such as the efficiency problem in high dimension dataset and the initial parameter selection problem.Problem and insufficiency in distance-based outlier detection are analyzed and a coarse-grained approach with several improvement measures is proposed in this paper. By pruning the whole dataset with an expanded cell structure, the efficiency of outlier detection is obviously improved. Meanwhile the process of outlier detection is optimized by using a more reasonable initial parameter calculated by KNN method. The main achievements of this paper include the follow aspects.①The present condition and process of data mining is introduced and the relative knowledge of outlier is summarized. By analyzing and comparing the present outlier detection algorithms, the relative merits and applicability of each algorithm is suggested.②An overview of preprocessing techniques is proposed by focusing on the theory and method of data scrubbing, Data integration and transformation and data reduction. Meanwhile the dimension reduction technology is summarized by introducing the present theory and method both in feature selection and feature transformation.③A distance-based outlier detection algorithm called coarse-grained approach is proposed which improved the original cell-based algorithm by expanding the granularity of the cell. The experimental result shows that the coarse-grained approach is better than cell-based approach both in time and space complexity. ④A practical calculation method of initial distance parameter for the distance-based outlier detection is suggested by expanding the KNN method. By calculating a reasonable initial distance parameter, the process of knowledge discovery is optimized and the degree of supervising dependence is reduced.The experiment of this paper uses two UCI datasets, Abalone and El Nino. It compares the efficiency between coarse-grained approach and cell-based approach with different influence factor such as distance, proportion, dimension, dataset size. The experiment result shows that the coarse-grained approach can discover the outlier effectively and it is better than cell-based approach in performance.

Keywords/Search Tags:

data mining, outlier, cell, distance parameter

PDF Full Text Request

Related items

1	An Outlier Mining And Paralleling Method Based On The Grid Cell And P Weights
2	Outlier Data Mining Algorithm Based On Distance And And Application
3	Study On Distance-Based Outlier Mining Algorithm
4	Research And Application Outlier Detection Method Based On Density&Distance
5	Study Of Outlier Data Mining Algorithm Based On Web Service Security
6	The Study, Distance-based Clustering And Outlier Detection
7	Research On Outlier Detection Algorithm In Data Mining
8	Research Of Data Mining Based On Outliers And Appliction In It Audit System
9	Research On Outlier Mining Method Oriented To Multidimensional Data
10	The Implementation Of Crime-Related Telecommunication Activities Finding Based On Outlier Data Mining