Font Size: a A A

Outlier Mining And Parallelization Based On Reverse K-Nearest Neighbor Count And Weight Pruning

Posted on:2020-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhuFull Text:PDF
GTID:2428330590456614Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid rise of the Internet has led to explosive development of technology in all fields of life,resulting in massive high-dimensional data sets in various forms.The “data-rich,information-poor” situation leads to higher demands on data analysis methods.Data mining,which is an important method of data analysis,refers to the discovery of hidden,previously unknown but potentially useful and interesting patterns or information from massive data.Traditional outlier data mining method is affected by the “curse of dimensionality”.The similarity or distance among data objects becomes indistinguishable or distinguishable,and the mining effect and efficiency are poor.In this thesis,an outlier data mining method and its parallelization are studied by using reverse k-nearest neighbor count and weight pruning strategy.The main research results are as follows:(1)An outlier data mining algorithm(RKNNCWP)based on reverse k-nearest neighbor count and weight pruning is presented.Firstly,object numbers appearing among the k nearest neighbor of all other objects are calculated in the dataset,and antihub score of each object is obtained.Secondly,the ratio of the distance mean between object and its k nearest neighbors to the k nearest neighbor distance mean of the data set is used as the weight value,and the object whose weight is greater than or equal to 1 is selected as the outlier candidate set list.Weighted antihub score of the object in the list are calculated.Outlier score calculating formula is redefined by using reverse k-nearest neighbor count,k value,and the distance mean of object and its KNN.Outlier score of all objects in the candidate set is calculated,and the top-n objects with larger values are selected as the outliers.In the end,experimental results validate the effectiveness and feasibility of the algorithm by synthetic and UCI datasets.(2)Based on Spark distributed computing platform,an outlier data parallel mining algorithm(SRKNNCWP)based on reverse k-nearest neighbor count and weight pruning is presented.In the algorithm,the KNN information is converted into resilient distributed dataset.Antihub scores of the object and its KNN,weight of the objects are stored in the memory,and an outlier candidate set is generated.Thereby the parallel efficiency of outlier mining are effectively improved,and I/O costs are reduced.In the end,experimental results validate the scalability and extensibility of the parallel algorithm by synthetic dataset.
Keywords/Search Tags:Outlier mining, Reverse k-nearest neighbor, Weight pruning, Spark, Distributed computation
PDF Full Text Request
Related items