Font Size: a A A

Research And Application Of Outlier Detection Algorithm

Posted on:2018-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:X J MaFull Text:PDF
GTID:2428330545454471Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Data mining can be regarded as the result of the natural evolution of information technology,which is a hot issue in the field of computer research.Its significance lies in digging out the new,practical and potential application value from the massive data,and finally find out the knowledge can be understood.In the traditional data mining,people generally pay attention to the patterns of most data in the data set,such as the classification of the decision,the association rules,frequent patterns of mining and clustering analysis.Outlier data mining is the discovery of relatively sparse and isolated outliers from large data sets.Outlier data mining is becoming a useful tool in many applications,especially in medical detection,fraud detection in the financial field,network intrusion monitoring,and disaster prediction in meteorological forecasting.The existing outlier detection methods are mainly divided into the following categories: statistics-based method,frequency-based method,distance-based method,depth-based method,and density-based method.The existing outlier detection methods are proposed for the specific data object set,however,the data objects in the real world usually have both numerical attributes and classification attributes.But most of the outlier detection algorithm can only analyze numeric attribute or classification attribute.The processing of data sets with mixed properties is ineffective.Traditional outlier detection algorithm cannot deal with the mixed data and the accuracy of most existing outlier detection algorithms for mixed data is not high enough as desired.To solve the problem,A two-stage outlier detection algorithm is proposed for mixed data which combining DBSCAN clustering and new local outlier factor LAOF based on region density.Firstly,the mixed data is preliminarily filtered by the DBSCAN clustering algorithm,however,the parameters ? and Minpts need to be determined artificially,which may lead to the poor accuracy.In this paper we input the number of K nearest neighbor substituted for Minpts and the cluster radius is determined by the K nearest neighbor,which reduces the parameter input and improves the clustering quality,and also a space for time algorithm is designed in the process of clustering algorithm.For the initial filtered data set,the local anomaly of the object in the anomalous mixed data set is calculated by using the newly constructed local anomaly LAOF which is based on the regional density.In the process of distance measure for mixed data,the attribute weight is determined by dividing the information entropy difference,and the secondary weight determination is carried out in the further detection process,and the outlier attribute is highlighted.The algorithm is validated on real data sets,and shows that the algorithm can improve the accuracy of outlier detection and apply the improved outlier detection algorithm to the detection of heart disease in medical problems.The algorithm can be higher and faster.
Keywords/Search Tags:Data Mining, Outlier Detection, Information Entropy, Clustering, Outlier Factor
PDF Full Text Request
Related items