Font Size: a A A

Study Of Outlier Detecting Algorithm Based On Natural Nearest Neighbor And Weighted Attribute Entropy

Posted on:2016-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:J Y WenFull Text:PDF
GTID:2308330479984810Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Outlier detection is a new branch of data mining, which can help people obtain information with remarkable abnormal characteristics in a wide variety of complex information. Outlier detection technology has been widely applied in internet, communication, economics, medical science, geology, astronomy fields and so on, such as intrusion detection, credit fraud, ECG monitoring, earthquake prediction, discovery of new planets and other such things. As mankind moves into a new age of digital information, all kinds of things and objects or phenomena can be stored and transformed by medium of digits, which increases the probability of dealing with various kind of complex datasets in people’s daily work. The steadily growing of data amounts and dimensions challenges the accuracy rate and efficiency of existing outlier detection algorithms.Under the above background, the research status and achievements of outlier detection technology at home and abroad have been summarized and analyzed in this paper. After that, a brief introduction of related applications and pretreatment work of outlier detection have been given, as well as the principles and advantages or weaknesses of both traditional outlier detection algorithms and the newly popular techniques. Based on the above introductions, considering that outlier is essentially a kind of small probability event, and what’s more, the development trend of outlier detection techniques has been approaching the reflection and exploration of the nature of outlier, this paper chose the method of outlier detection based on information entropy which measures the irregularity degrees of datasets via computing the distribution of attribute values, and began research.After a comprehensive analysis of the development history and achievements of outlier detection techniques based on information entropy, and their tweaks, the EOF(Entropy Outlier Factor) algorithm which has overall advantages in algorithm complexity, detection rate, and universality of general datasets was regarded as the basis of research in this paper. First, by improving the procedure of outliers’ outputs in the algorithm, a NCEOF algorithm having a local optimization than EOF has been presented in this paper. Furthermore, in order to improve the detection rate and universality of datasets with different quantity scales, dimension scales and complexity, the concept of natural nearest neighbor was introduced to measure a data point’s deviation level of local attribute entropy on continuous attributes. And further combining the measurement of the global and local attribute weights, the HLEAWOF algorithm based on natural nearest neighbor and weighted attribute entropy was presented in this paper.Experiments on dataset Wisconsin Breast Cancer from UCI and partial of dataset KDD-Cup99 have been conducted for the new algorithms NCEOF and HLEAWOF at the end of this paper. Compared with EOF algorithm in the same environment, NCEOF showed its local optimization than EOF algorithm, and HLEAWOF showed its advantages of universality of datasets and detection rate over EOF algorithm.Finally, this paper gave a summarize of related work and the vision of future trend of outlier detection techniques.
Keywords/Search Tags:Outlier Detection, Natural Nearest Neighbor, Weighted Entropy, Entropy Increment, Local Information Entropy
PDF Full Text Request
Related items