Font Size: a A A

The Classification Of Imbalanced Large Data Sets Based On Map Reduce

Posted on:2016-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:C X WangFull Text:PDF
GTID:2308330479477645Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The classification of imbalanced large data sets is a hot research topic in machine learning in recent years. Because that there exist many imbalanced large data sets in practical application fields, such as, medical diagnosis data, credit card fraud detection data, network intrusion detection data, etc. It is very meaningful in theory and valuable in practice to investigate the classification of imbalanced large data sets.Aim at classifying imbalanced large data sets with two classes, the paper presents an algorithm which combine cross-oversampling for positive instances and integration of classifiers. Specifically, in the phase of oversampling, the cross-oversampling for positive instances is done with the following two steps alternately. Step 1: we firstly calculate the center of positive instances with Map Reduce, and then sample instance points along the line between the center and each positive instance. Step 2: for each instance point in new positive class, we firstly find its k nearest neighbors in negative instances with Map Redcue, and then sample instance points along the line between the instance and its k nearest negative neighbors. In the phase of integration of classifiers, firstly, we sample instances several time from the negative class with the same size with the generated positive instances, secondly, we combine them with the generated positive instances and obtain several balanced data sets. Finally, several component classifiers are trained with extreme learning machine from the obtained balanced data sets, and integrated with the simple majority voting methods. The experimental results show that the proposed algorithm can obtain promising speed-up and scalability.
Keywords/Search Tags:Imbalanced large data sets, MapReduce, Extreme learning machine, Ensemble learning, Majority voting method
PDF Full Text Request
Related items