A Two Phases Outlier Mining And Paralleling Method Based On Subspace

Posted on:2017-04-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Yin

Full Text:PDF

GTID:2348330509952863

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Outlier mining is one of the main research contents in the field of data mining. For high dimensional data set, how to effectively reduce the dimension disaster and improve the accuracy and efficiency of data mining is a major problem in outlier mining. In this thesis, a two-stage outlier detection algorithm and its parallel was studied from improving the effect of outlier data mining. The main research results are as follows:(1) A two-stage outlier mining algorithm is presented by selecting potential stray object cuts the ideas of the amount of calculation.In the first stage, the density ratio of each data object is calculated in each dimension,after which take the log of the product of all dimensions' density ratio average as density coefficient, and select the candidate from the group of objects;In the second stage, regarding the candidate object's neighbors in each subspace of the deviation degree of the product as a deviation ratio and density coefficient and deviation ratio as the product of coefficients from the group, and the stray data objects are determined. Because the oulier coefficient of candidate objects are only calculated, thus this algorithm improve the efficiency of mining effectively;Finally, using the UCI data sets, the experiments verified that the algorithm not only guarantee the accuracy of mining results, but also effectively improve the mining efficiency.(2) A parallel algorithm for mining data of the two stage based on the subspace is presented using Map Reduce programming model. First of all,every data in Data set assigned to each child node, and on every node,the density coefficient is calculated using a map function. After, summing results to the main node to get the candidate from the group of data sets using Ruduce function. To make each node load balancing, the candidate from the group of the data set object number, again will equal computing tasks assigned to each child nodes.Computing candidate outlier data set than the deviation of each object, use the Reduce function results summary to the main node, calculatethe candidate from the group of outlier factor of objects.The final sorting,screening outliers.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Based On The Agent - Aid Map - Reduce Architecture Research And Design Load Balancing Optimization
2	Research On Load Balancing And Related Technology In Large - Scale Multiplayer Online Game
3	Shared Ring And Traffic Load Balancing-Based Failure Recovery Approach In SDN Data Plane
4	Research On Extended Knowledge Discovery In High-Dimension And Sparse Outliers Set
5	Based On Feedback Scheduling Algorithms For Dynamic Load Balancing In The Heterogeneous Environment Of Hadoop Design And Implementation
6	Research On Dynamic Load Balancing Method Of Distributed Crawler System
7	Design And Implementation Of Web Load Balancing System
8	Design And Implementation Of Load Balancing System In The Cloud Platform
9	The Research And Design Of Load-Balancing Under CORBA Environment
10	Distributed Computer System, Dynamic Load Balancing Studies