Font Size: a A A

Research On Outlier Detection Algorithm In Data Mining

Posted on:2015-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:T T HuFull Text:PDF
GTID:2268330428960245Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Outlier detection is a branch of data mining. Its task is to identify the observations whose characteristics are significantly different from other data. In field of nature, human society, or data sets, most of the events and objects are ordinary or usual. But there are also many unusual or extraordinary objects. Value may be behind these objects. Outlier detection has broad application prospects. So outlier detection is a very interesting research.There are already a large number of methods of outlier detection, including method of statistic-based outlier detection, method of depth-based outlier detection, method of distance-based outlier detection and method of density-based outlier detection. In this paper, the background, significance and research status of outlier detection is introduced.The method of distance-based outlier detection and frequency-based method are analyzed. The paper analyzes the problems of traditional approachand improves the algorithm.Attributes can usually be divided into two categories, including numerical attributes and categorical attributes. The paper analyzes the differences between the two attributes and does the following work:For numeric data, the paper improves method of distance-based detection. The traditional distance-based detection algorithm has many parameters and is sensitive to the choice of parameters, so the average distance is chosen to detect outliers. This algorithm needs a lot of computations and is not suitable in the large data set. To solve the problem, some non-outliers are pruned by the rule that if the number of the data in the r-neighborhood is k or more than k it is not outlier. By clustering, clusters are sorted by the density of the clusters. The cluster whose density is low is firstly detected. The pruning threshold can increase quickly. Pruning rules are used again. This can greatly reduce the computing time.For the categorical data, the paper analyzes the shortcomings of distance-based method. The methods are introduced which are commonly used for categorical data including entropy-based method and frequency-based method. The paper points out the lack of frequency-based algorithm AVF and improves it. Data set is clustered by k-modes clustering algorithm which is used for categorical data to remove objects with high similarities, then frequency-based method is used to detect outliers in order to achieve better detection.
Keywords/Search Tags:Outlier Detection, Average Distance, Frequency
PDF Full Text Request
Related items