Based On The Gini Index And Attribute Correlation Of Outlier Data Mining And Its Parallelization

Posted on:2014-02-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Shi

Full Text:PDF

GTID:2248330395491733

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Outlier is the inconsistent and unusual data object mined from a given dataset with other data objects, and has been widely applied in the fields of frauddetection, network security analysis, etc. At present, most outlier miningalgorithms are mainly aimed at small and medium-sized low-dimensional dataset rather than massive and high-dimensional data set. In this thesis, outliermining algorithm and its parallelism are studied for high dimensional data setby combining the attributes relevance analysis and the gini index. The mainresearch works are as follows:1) An outlier mining algorithm is presented by using attribute relevanceanalysis and gini index. Firstly, redundant attributes are removed from thehigh-dimensional data set by using the attribute relevance analysis, so that thesize of the data set is reduced. Secondly, outliers are mined by using gini indexas the measure factor in the data set reduced. In the end, experimental resultsvalidate the feasibleness and effectiveness of the algorithm by using the starspectrum data set.2) A parallel outlier mining algorithm is presented by using attributerelevance analysis and gini index. Firstly, data set is divided vertically intomultiple of subsets, and these are assigned to every data node respectively.Redundant attributes are deleted in every data node by analyzing attributerelevance, so that the size of the data subset is reduced. The data subsetsreduced are returned to name node, and the reduced data set is taken. Secondly,reduced data set is assigned to each data node by attribute, and every data nodecalculates the outlier measure factor by using attribute relevance analysis andgini index. Then, data objects with least measure factor value are selected asoutliers. Finally, the experiment results validate that the algorithm has a goodscalability by using the star spectrum data set as data set under Hadoopenvironment.

Keywords/Search Tags:

Outlier, High-dimensional data set, Attribute relevance analysis, Gini index, Hadoop environment, MapReduce

PDF Full Text Request

Related items

1	Outlier Mining Method Based On Gini Indexes And Sub-space Research
2	Study On Algorithms For Fast Outlier Detection
3	Research On Data Index Application In The MapReduce Framework
4	Research On Key Technology Of Hadoop-based Network Security Log Audit System
5	Analysis And Research Of Outlier Detection Algorithm For High Dimensional Data
6	Key Technology Research On Mixed Store And Two Level Index Of High-dimensional Big Data In Hadoop
7	Property Analysis-based Local Outlier Mining Algorithm And Its Application
8	Research And Application On Outlier Detection Algorithm For High-dimensional Data Stream
9	Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy
10	Research Of Outlier Detection Algorithm Based On Hadoop