Font Size: a A A

Outlier Detection Algorithm And Its Parallelization Based On Weighted K-Nearest Neighbor

Posted on:2019-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:J J GuoFull Text:PDF
GTID:2428330566476376Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
K-nearest neighbor is a common query method in outlier data mining,but the traditional k-nearest neighbor query method is difficult to adapt to the massive and high-dimensional data set.The k-nearest neighbor query method,which performs well in low-dimensional space,has a different degree of deterioration in high-dimensional space.Especially,when the amount of data increases dramatically,the traditional k-nearest neighbor query method has low processing efficiency,and it is difficult to meet the actual needs of massive data analysis.In addition,the importance of all the attributes are considered equally in the traditional k-nearest neighbor query method.Ignoring differences among attributes will not be able to query meaningful neighbor data objects.In this thesis,weighted k nearest neighbors methods and outliers data mining algorithms are studied for massive and high-dimensional data set.The main research results are as follows.(1)An outlier data mining algorithm is presented based on weighted k-nearest neighbors.In the algorithm,the importance of all attributes in the data set is measured by using information entropy.Z-order space filling curve is used to encode high-dimensional data,and map it into Z-value,which is used as a basis for querying weighted k-nearest neighbors.On the basis of considering individual difference of each data object,a outlier mining algorithm based on weighted k-nearest neighbors is presented.The efficiency and accuracy of the algorithm is validated by using the artificial and UCI standard datasets.(2)The parallel query method of weighted k-nearest neighbor and the parallel mining algorithm of outlier are presented on Hadoop parallel computing platform.Firstly,the weights of each attribute are calculated by using information entropy in sample dataset which is generated through random sampling method.Secondly,Z-order space filling curve are generated by the input data and the weights,and Z-value of all data object is computed in each computing node.Third,the k-nearest neighbor candidate sets are divided byusing the LSH strategy.The similar Z-value objects are placed on the same computing node to alleviate the data skew problem among nodes.Fourth,outlier scores based on the distance between each object and its weighted k-nearest neighbor are calculated.Therefore,a novel parallel outlier mining algorithm is presented.In the end,the scalability and extensibility of the algorithm is validated by using the artificial,UCI standard datasets and stellar spectral records on Hadoop cluster.
Keywords/Search Tags:Outlier Data, k-Nearest Neighbor Query, Weight, Space Filling Curve, MapReduce
PDF Full Text Request
Related items