Font Size: a A A

A Two Phases Outlier Mining And Paralleling Method Based On Subspace

Posted on:2017-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YinFull Text:PDF
GTID:2348330509952863Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Outlier mining is one of the main research contents in the field of data mining. For high dimensional data set, how to effectively reduce the dimension disaster and improve the accuracy and efficiency of data mining is a major problem in outlier mining. In this thesis, a two-stage outlier detection algorithm and its parallel was studied from improving the effect of outlier data mining. The main research results are as follows:(1) A two-stage outlier mining algorithm is presented by selecting potential stray object cuts the ideas of the amount of calculation.In the first stage, the density ratio of each data object is calculated in each dimension,after which take the log of the product of all dimensions' density ratio average as density coefficient, and select the candidate from the group of objects;In the second stage, regarding the candidate object's neighbors in each subspace of the deviation degree of the product as a deviation ratio and density coefficient and deviation ratio as the product of coefficients from the group, and the stray data objects are determined. Because the oulier coefficient of candidate objects are only calculated, thus this algorithm improve the efficiency of mining effectively;Finally, using the UCI data sets, the experiments verified that the algorithm not only guarantee the accuracy of mining results, but also effectively improve the mining efficiency.(2) A parallel algorithm for mining data of the two stage based on the subspace is presented using Map Reduce programming model. First of all,every data in Data set assigned to each child node, and on every node,the density coefficient is calculated using a map function. After, summing results to the main node to get the candidate from the group of data sets using Ruduce function. To make each node load balancing, the candidate from the group of the data set object number, again will equal computing tasks assigned to each child nodes.Computing candidate outlier data set than the deviation of each object, use the Reduce function results summary to the main node, calculatethe candidate from the group of outlier factor of objects.The final sorting,screening outliers.
Keywords/Search Tags:Related subspace, Candidate outliers, Map Reduce, Load balancing
PDF Full Text Request
Related items