With the rapid development of computer technology,data mining technology has also got rapid development.Outlier mining is an important direction of data mining,also called outlier detection.Outlier refers to data that is inconsistent with most data or deviates from normal behavior.It has been widely used in the fields of network intrusion,medical diagnosis,credit card fraud,fault diagnosis and so on.It is also an important means to obtain effective information in big data age.At present,in view of the outlier detection problem in data mining,the domestic and foreign scholars have proposed a variety of outlier detection methods,including outlier detection method based on statistics,distance,density,clustering,etc.According to the data attribute type,outliers can be divided into the categorical outlier and the numerical outlier.This thesis analyzes the background,significance and research status of outlier detection methods,and mainly focuses on the two data types.For the categorical outlier,a pruning algorithm is proposed to classify the attribute data.The preprocessing of the data removes the impossible outlier and proves its rationality.Then,an improved outlier detection method for information entropy is introduced.The candidate sets obtained by pruning are used to detect the outlier by using entropy,which avoids scanning the data set many times and improve the time efficiency.In view of the sparse data distribution,the method of using attribute value frequency(AVF)as an auxiliary criterion is proposed to improve the accuracy of outlier detection.The experimental results show that the proposed method can detect outlier of categorical attribute data more efficiently and improve accuracy rate.For the categorical outlier,K-means clustering and density-based LOF(local outlier factor)algorithm are used to detect outlier.In order to reduce the number of iterations of K-means algorithm and improve the efficiency of clustering,A high-density set is selected as the candidate set of the cluster center,and then an initial center of clustering algorithm is selected based on the maximum distance product method.The whole clustering process is combined with MapReduce programming model.A reasonable pruning algorithm is used to select candidate sets of oulier for each cluster.Finally,the candidate set is judged twice based on the density LOF algorithm to obtain more accurate outliers.Experimental results show that the initial clustering center algorithm based on distance product has higher clustering efficiency,and the proposed method is more accurate and has better expansibility and acceleration ratio for the outlier of numerical attributes. |