Font Size: a A A

Based On The Gini Index And Attribute Correlation Of Outlier Data Mining And Its Parallelization

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShiFull Text:PDF
GTID:2248330395491733Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Outlier is the inconsistent and unusual data object mined from a given dataset with other data objects, and has been widely applied in the fields of frauddetection, network security analysis, etc. At present, most outlier miningalgorithms are mainly aimed at small and medium-sized low-dimensional dataset rather than massive and high-dimensional data set. In this thesis, outliermining algorithm and its parallelism are studied for high dimensional data setby combining the attributes relevance analysis and the gini index. The mainresearch works are as follows:1) An outlier mining algorithm is presented by using attribute relevanceanalysis and gini index. Firstly, redundant attributes are removed from thehigh-dimensional data set by using the attribute relevance analysis, so that thesize of the data set is reduced. Secondly, outliers are mined by using gini indexas the measure factor in the data set reduced. In the end, experimental resultsvalidate the feasibleness and effectiveness of the algorithm by using the starspectrum data set.2) A parallel outlier mining algorithm is presented by using attributerelevance analysis and gini index. Firstly, data set is divided vertically intomultiple of subsets, and these are assigned to every data node respectively.Redundant attributes are deleted in every data node by analyzing attributerelevance, so that the size of the data subset is reduced. The data subsetsreduced are returned to name node, and the reduced data set is taken. Secondly,reduced data set is assigned to each data node by attribute, and every data nodecalculates the outlier measure factor by using attribute relevance analysis andgini index. Then, data objects with least measure factor value are selected asoutliers. Finally, the experiment results validate that the algorithm has a goodscalability by using the star spectrum data set as data set under Hadoopenvironment.
Keywords/Search Tags:Outlier, High-dimensional data set, Attribute relevance analysis, Gini index, Hadoop environment, MapReduce
PDF Full Text Request
Related items