Font Size: a A A

Study On Outlier Detection Based On K-nearest Neighborhood MST

Posted on:2015-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:X G FanFull Text:PDF
GTID:2268330422472251Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Outlier detection is a very important branch of data mining, aiming at finding thesmall data sets which is different from the main data sets. Outliers contain veryimportant information and knowledge. Therefore, to explore and utilize the hiddenwealth drive further development of outlier detection. In terms of theory, researchershave achieved great success in finding outlier detection methods. In practicalapplications, outlier detection is also widely used, such as credit card fraud detection,market analysis, weather analysis and image processing, etc.According to the scale, outliers are divided into point outliers and cluster outlier(or outlying cluster). Because of the complexity and diversity of data sets in practicalapplication, outliers exist not only in the form of points, but may also appear in theform of clusters. The small cluster reflects the non-mainstream exceptional mechanism,which is the reflection of outlier. If the small cluster can’t be effectively detected, itmight mean that some extremely useful information is missing. Therefore, outlyingcluster detection is also crucial for KDD. Current outlier detection algorithms aremainly concentrated on detecting outlier points, but often neglect outlying clusters.Besides, current methods is sensitive to the parameters and the shape of data sets.Based on the above discussion, a cutting method based on k-neighborhood MST. isproposed in this paper which aims at outlier detection as well as outlying clusterdetection. Taking density and directional factor into account, the dissimilarity measurebased on k-neighborhood is proposed, which has solved the problem of the density ofdata sets. Subsequently, the minimum spanning tree (MST) was established based onthis dissimilarity measure. Finally, we take turns cutting off the largest edges andregard the sub-trees of which the size is less than or equals k as outliers. The specificresearch and results are listed as following:①The background and significance of outlier detection are introduced, and theinvestigation and study of the current situation of outlier detection at home and abroadare present.②The outliers’ cause, outliers’ classification and application of outliers detectionare introduced. Meanwhile, different kinds of outlier detection algorithms areintroduced systematically and the advantages and limitations of all kinds of outliersdetection technology are analyzed emphatically. Finally, the current hotspot and trend of the technology of detecting outliers are expounded.③A new dissimilarity measure method(dissimilarity measure method based on knearest neighborhood) is proposed based on study of current dissimilarity measuremethod. The new method proposes that dissimilarity between data objects isdirectional for first time, and taking the density factors into account making the finalalgorithm more adapted to different density of data sets.④A new outlier detection method(Outlier Detection based on K-nearestNeighborhood MST,KNMOD) is present for the poor outlying clusters detection ofcurrent methods. The new algorithm greatly improves the detection efficiency of thealgorithm with use of Fibonacci heap, and can overcome the problem of data setsshape with the help of tree features.Compared with other methods such as LOF,KNN,INFLO on the artificial data setswith obvious local outliers and outlying cluster proves the properties of KNMOD. Atthe same time, the comparison experiments with algorithm LOF, KNN, INFLO andCOF on5real data sets from UCI is made and the validity and rationality of KNMODare verified.
Keywords/Search Tags:outlier detection, k nearest neighborhood, dissimilarity measure, minimumspanning tree, outlying cluster
PDF Full Text Request
Related items