Font Size: a A A

Contextal Outlier Mining And Parallelization Based On Weighted Probability Density

Posted on:2022-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:H BaiFull Text:PDF
GTID:2518306521496784Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier detection is an important branch of data mining.Its task is to detect the data objects that are significantly different from the vast majority of data in the data set,and to reveal the meaningful information and knowledge hidden behind the data objects.With the development of information technology,the amount of data increases rapidly and the dimensions of data also increase constantly."Disaster maintenance" has become one of the main reasons for affecting the detection effect of traditional outlier data,and it is difficult to be applied to big data analysis tasks.In this paper,we use weighted probability density and correlation subspace to investigate the context outlier data detection and its parallelization.The main results are as follows:(1)A context outlier detection algorithm based on weighted probability density is presented.Firstly,Gaussian mixture model and sparsity matrix are used to determine the correlation subspace.Secondly,in the correlation subspace,the weighted probability density is used to calculate the local outlier factor,which effectively reflects and describes the degree of inconsistency between the data object and its surrounding data objects.Then,N data objects with the largest outlier factors are selected as outlier data,and outlier factors,values of related subspace attributes and local data sets are taken as context information to improve the interpretability and comprehensibility of outlier data.Finally,the effectiveness of the algorithm is verified by experiments with artificial and UCI data sets.(2)Based on Spark parallel computing platform,a parallel detection algorithm for outlier data based on weighted probability density is presented.Firstly,the elastic data set is used to store the intermediate results generated by the local data set,sparsity matrix,attribute weight,and correlation subspace matrix in memory.Secondly,the outlier score of the data object is calculated on each calculation node.Finally,the scalability of the proposed algorithm are verified by experiments with artificial data sets.
Keywords/Search Tags:Outlier mining, Subspace, Weighted probability density, Contextual information, Spark
PDF Full Text Request
Related items