Font Size: a A A

Research On Binary Imbalanced Large Data Classification And Its Application

Posted on:2019-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:X M LiuFull Text:PDF
GTID:2428330566965491Subject:Master of Engineering - Software Engineering
Abstract/Summary:PDF Full Text Request
Big data is data too big to be handled and analyzed by traditional software tools,big data can be characterized by five V's features: volume,velocity,variety,value and veracity.However,in the real world,some big data have another feature,i.e.class imbalanced,such as e-health big data,credit card fraud detection big data and extreme weather forecast big data are all class imbalanced.In the framework of binary imbalanced classification,Binary imbalanced data refers to a type of data,where one class(positive class)is highly under-represented compared to another class(negative class).In the scenario of binary imbalanced classification,to which both traditional classification approaches and assessment metrics can't be directly applied.Because there are large requirements for dealing with binary imbalanced data in practice,accordingly it is important both in theory and application to investigate problem of binary imbalanced classification.This paper investigated the problem of binary imbalanced classification,and proposed two methods for classification of binary imbalanced large data sets based on oversampling and ensemble learning.The ideas of the proposed two methods are roughly same,both approaches include three steps: the first step is to oversample positive instances,the second step is to construct balanced sub data sets,ang training basic classifier with the constructed sub data sets.the third step is to integrate the trained basic classifiers by different ensemble methods for binary imbalanced classification.The over sampling of the proposed methods are different.In the first method,for each positive instance,we oversample some positive instances on the line between the positive instance and its every enemy negative instance.In the second method,we oversample some positive instances within its enemy nearest neighbor hypersphere.Because the negative class is a large data set,consequently we calculate the distance between every positive instance and every negative instance by MapReduce.Both methods for constructing balanced sub data sets are same,according to the cardinality of the set of positive instances,we partition the set of negative instances into some subsets,and some balanced subsets are generated with the set of positive instances and the subset of negative instances,next some basic classifiers are trained with extreme learning machine with the generated balanced subsets.The ensemble strategies of the two proposed methods are different,the first ensemble method is the majority voting,and the second one is fuzzy integral.We conducted some experiments on multiple data sets to verify the effectiveness of the two proposed methods,and experimentally compared with related approaches,the experimental results and the statistical analysis demonstrate that the proposed algorithm is effective and efficient,and the proposed methods outperform related methods on classification accuracy and running time.
Keywords/Search Tags:Imbalanced large data sets, Over sampling, Extreme learning machine Ensemble learning, Majority voting method, Fuzzy integral
PDF Full Text Request
Related items