Font Size: a A A

Outlier Mining And Parallelization Based On Hubness Phenomenon

Posted on:2019-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:F GuoFull Text:PDF
GTID:2428330566976376Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier mining is one of the most important research contents in data mining,and reverse K-nearest neighbor is a common technique in outlier minging.Outlier refer to the data which does not satisfy the general pattern or behavior of data,or obviously deviated from other data.However,because of curse of dimensionality,most traditional outlier mining methods are difficult to apply to massive highdimensional data sets.In the thesis,outlier algorithm and its parallelization are studied for aiming at the the hubness phenomenon in reverse K-nearest neighbor.The main research works are as follows:1)We propose an outlier mining algorithm based on Hubness phenomenon and weighted outlier score.Firstly,according to the Hubness phenomenon in reverse K-nearest neighbor and the relationship with outlier,the distance information in KNN is used as weighted outlier score of reverse K-nearest neighbor.Secondly,the discrimination threshold is randomly generated,and the satisfaction value is determined according to the threshold.Using satisfaction value,outlier score of data object is calculated and some data objects with the largest outlier score are selected as outliers.In the end,the experimental results validate efficiency and accuracy of the algorithm by using synthetic and UCI datasets.2)We propose an parallel outlier mining algorithm based on Hubness phenomenon by using distributed computing framework Spark.By using Resilient Distributed Datasets(RDD),data and computing tasks are distributed to various computing nodes and the outlier score of all data objects is calculated separately.The KNN,the reverse K nearest neighbors outlier score caches in memory.which rudece the I/O cost and improve the efficiency of the algorithm.The experimental results validate the scalability and extensibility of the algorithm on the stellar spectral datasets.
Keywords/Search Tags:Outlier Mining, RKNN, Hubness Phenomenon, Distance Weighting, Spark
PDF Full Text Request
Related items