Outlier Mining And Parallelization Based On Reverse K-Nearest Neighbor Count And Weight Pruning

Posted on:2020-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Zhu

Full Text:PDF

GTID:2428330590456614

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The rapid rise of the Internet has led to explosive development of technology in all fields of life,resulting in massive high-dimensional data sets in various forms.The �data-rich,information-poor� situation leads to higher demands on data analysis methods.Data mining,which is an important method of data analysis,refers to the discovery of hidden,previously unknown but potentially useful and interesting patterns or information from massive data.Traditional outlier data mining method is affected by the �curse of dimensionality�.The similarity or distance among data objects becomes indistinguishable or distinguishable,and the mining effect and efficiency are poor.In this thesis,an outlier data mining method and its parallelization are studied by using reverse k-nearest neighbor count and weight pruning strategy.The main research results are as follows:(1)An outlier data mining algorithm(RKNNCWP)based on reverse k-nearest neighbor count and weight pruning is presented.Firstly,object numbers appearing among the k nearest neighbor of all other objects are calculated in the dataset,and antihub score of each object is obtained.Secondly,the ratio of the distance mean between object and its k nearest neighbors to the k nearest neighbor distance mean of the data set is used as the weight value,and the object whose weight is greater than or equal to 1 is selected as the outlier candidate set list.Weighted antihub score of the object in the list are calculated.Outlier score calculating formula is redefined by using reverse k-nearest neighbor count,k value,and the distance mean of object and its KNN.Outlier score of all objects in the candidate set is calculated,and the top-n objects with larger values are selected as the outliers.In the end,experimental results validate the effectiveness and feasibility of the algorithm by synthetic and UCI datasets.(2)Based on Spark distributed computing platform,an outlier data parallel mining algorithm(SRKNNCWP)based on reverse k-nearest neighbor count and weight pruning is presented.In the algorithm,the KNN information is converted into resilient distributed dataset.Antihub scores of the object and its KNN,weight of the objects are stored in the memory,and an outlier candidate set is generated.Thereby the parallel efficiency of outlier mining are effectively improved,and I/O costs are reduced.In the end,experimental results validate the scalability and extensibility of the parallel algorithm by synthetic dataset.

Keywords/Search Tags:

Outlier mining, Reverse k-nearest neighbor, Weight pruning, Spark, Distributed computation

PDF Full Text Request

Related items

1	The Study On Multiple Tpyes Reverse Nearest Neighbor Queries
2	Research On Spatial Queries For Moving Objects In Indoor Space
3	Outlier Detection Algorithm And Its Parallelization Based On Weighted K-Nearest Neighbor
4	Outlier Detection Algorithm And Application For Hubness Phenomenon
5	Research And Application Of K Nearest Neighbor Classification Algorithm Based On Spark
6	Research On Spatial Clustering Algorithms
7	Research On Improved K Nearest Neighbor Algorithm Based On Spark Cloud Computing Platform
8	Research On Optimal-Nearest-Neighbor And Reverse Visible Nearest Neighbor Queries
9	Research Of Reverse K-Nearest Neighbor For Moving Objects
10	Research On Data Stream Reverse K Nearest Neighbors Outlier Mining Algorithm Based On X~* Tree