Font Size: a A A

Study On Algorithms For Fast Outlier Detection

Posted on:2009-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:L YaoFull Text:PDF
GTID:2178360275451026Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The problem of outlier mining has been variously called anomaly detection, detecting rare events,exception mining,outlier detection,deviation detection,etc. Outlier may be "dirty data",but it also can means meaningful event corresponding to the reality.From the point of knowledge discovery,rare events are often more interesting and valuable than others in many domains.As a result,outlier mining is an important and meaningful research in data mining.At present,outlier detection is gradually becoming a hotspot for the researchers of database,machine learning and statistics.Because of the complexity and mass characteristics of the high-dimensional data, traditional outlier mining methods can not be good for high-dimensional large datasets. Often data is incomplete and user only concerned on the instability of local data, therefore mining of local outlier is the focus of the article.There are key problems analyzed about algorithms for fast outlier detection,which are how to determine the local neighborhood and how to judge the local outliers.The method of attribute partition is used to divide attributes in each data object into environmental attributes and inherent attributes.The inherent attributes characterize the data object while the environmental attributes embody the relationship between this data object and the neighbor data objects.Research on the method of dimension reduction about high-dimension data and pruning strategy about mass data are two key points in the article.The author's main work are listed as follows:(1) Analysis of the characteristics of high-dimensional data,researching a LLE similar method of dimensional reduction based on attribute division(Ad-LLE algorithm).Ad-LLE algorithm is not only inherited the LLE algorithm translation, rotation and scaling invariance,but also uses of environmental attributes of the object to determine the local neighborhood,according to the object of their different neighbor's distribution to set a different weighted neighbor and the number of neighbors.In the search process to local neighbors,we adopt R* -tree to index environmental attributes in order to speed up the rate of retrieval.Analysis showed that Ad-LLE algorithm is more efficient and rational. (2) An outlier detection algorithm based on Ad-LLE.Ad-LLE algorithm is use in outlier mining,adopted by the Ad-LLE algorithm to reduce dimensionality in high-dimensional data and then used traditional outlier mining algorithms to detect outliers.We compare the new outlier mining algorithm to the distance-based outlier detection and other dimensional reduction methods used in outlier mining.From the results of detection,algorithm in this paper is suitable for high-dimensional outlier mining,meanwhile showed better performance in dimensional reduction and effective outlier detection.(3) Combination of local outlier and outlier own characteristics,we propose a fast outlier detection(Fast-OD).People are often only partial attention to local outlier and outliers are only a very small part in a whole dataset.When the dataset is large, mining outliers in entire dataset is difficult and inefficient.As a result,we consider the use of two heuristic pruning strategies to focus on the large number of non-outlier data in the dataset in order to improve the efficiency.The experiment proved that,in this paper,Fast-OD algorithm has the advantages on reducing user's dependence, reducing complexity,increasing the accuracy and scalability.(4) Through sampling study to obtain the global and local similar outlying degree (GT and LTN(yi)).In Fast-OD algorithm,when there are a large number of N,with the entire dataset for calculating GT and LTN(yi) are very difficult and time-consuming. Uniform sampling method is used to obtain the value of GT and LTN(yi) greatly reduces the complexity of the algorithm.Theory and analysis shows that using sampling techniques so that the Fast-OD algorithm is better for high-dimensional large database.
Keywords/Search Tags:outlier detection, attribute partition, high-dimensional index, dimension reduction, pruning strategy, weighted neighbor
PDF Full Text Request
Related items