Outlier Detection Algorithm And Its Parallelization Based On Weighted K-Nearest Neighbor

Posted on:2019-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:J J Guo

Full Text:PDF

GTID:2428330566476376

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

K-nearest neighbor is a common query method in outlier data mining,but the traditional k-nearest neighbor query method is difficult to adapt to the massive and high-dimensional data set.The k-nearest neighbor query method,which performs well in low-dimensional space,has a different degree of deterioration in high-dimensional space.Especially,when the amount of data increases dramatically,the traditional k-nearest neighbor query method has low processing efficiency,and it is difficult to meet the actual needs of massive data analysis.In addition,the importance of all the attributes are considered equally in the traditional k-nearest neighbor query method.Ignoring differences among attributes will not be able to query meaningful neighbor data objects.In this thesis,weighted k nearest neighbors methods and outliers data mining algorithms are studied for massive and high-dimensional data set.The main research results are as follows.(1)An outlier data mining algorithm is presented based on weighted k-nearest neighbors.In the algorithm,the importance of all attributes in the data set is measured by using information entropy.Z-order space filling curve is used to encode high-dimensional data,and map it into Z-value,which is used as a basis for querying weighted k-nearest neighbors.On the basis of considering individual difference of each data object,a outlier mining algorithm based on weighted k-nearest neighbors is presented.The efficiency and accuracy of the algorithm is validated by using the artificial and UCI standard datasets.(2)The parallel query method of weighted k-nearest neighbor and the parallel mining algorithm of outlier are presented on Hadoop parallel computing platform.Firstly,the weights of each attribute are calculated by using information entropy in sample dataset which is generated through random sampling method.Secondly,Z-order space filling curve are generated by the input data and the weights,and Z-value of all data object is computed in each computing node.Third,the k-nearest neighbor candidate sets are divided byusing the LSH strategy.The similar Z-value objects are placed on the same computing node to alleviate the data skew problem among nodes.Fourth,outlier scores based on the distance between each object and its weighted k-nearest neighbor are calculated.Therefore,a novel parallel outlier mining algorithm is presented.In the end,the scalability and extensibility of the algorithm is validated by using the artificial,UCI standard datasets and stellar spectral records on Hadoop cluster.

Keywords/Search Tags:

Outlier Data, k-Nearest Neighbor Query, Weight, Space Filling Curve, MapReduce

PDF Full Text Request

Related items

1	Research On K Nearest Neighbor Queries And KNN-joins In High-dimensional Space
2	Research Of High-Dimensional Space Query Algorithm Based On Space-Filling Curves
3	Key Technologies Of Privacypreserving Query Based On Space-filling Curves For Location-based Services
4	Study On Moving K Neareast Neighbor Query Processing Techniques In Constrained Space
5	Research On The Visual Group K-Nearest Neighbor And Group Inverse K-Nearest Neighbor Query Of Multi-Source Objects In Three-Dimensional Space
6	Research On Multidimensional Data Query Based On Hilbert Spatial Filling Curve
7	Outlier Mining And Parallelization Based On Reverse K-Nearest Neighbor Count And Weight Pruning
8	LSH-based Similarity Search Of Web Data
9	The Research Of K-nearest Neighbor Query Of Spatial Data Based On Voronoi Diagram
10	Studies On Probabilistic Nearest Neighbor Queries Over Uncertain Data