Font Size: a A A

Research Of Ensemble Classification Methods For Class-imbalance And Cost-sensitive Datasets

Posted on:2018-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:X WeiFull Text:PDF
GTID:2348330512486747Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the arrival of big data era,machine learning,as the base of modern data anal-ysis technology,plays a crucial role and meanwhile faces a lot of challenges.Classifi-cation is one of the most basic and core problems in machine learning domain,abidingly attracting much attention from researchers.The traditional classification algorithms are based on two assumptions:1)the numbers of samples in different class are almost equal;2)the misclassified cost of samples in different class are nearly equal.While in the real world,datasets are always class-imbalance and cost-sensitive,which result in inappli-cability of traditional classification algorithms in such datasets.Class-imbalance means the number of some class sample is much more than other class;cost-sensitive denotes the misclassified cost of some class sample is not same as other class.In class-imbalance dataset,in order to obtain high accuracy,classifier tends to misclassify the minority class samples which,however,usually have greater importance;in cost-sensitive dataset,classifier is not sensitive that could not be able to minimize the total misclassified cost.Due to the universality and importance,academics have carried out extensive and intensive research on class-imbalance arid cost-sensitive problem,proposing a variety of methods.These methods can be concluded into two aspects:1)on data level,i.e,reconstruct the training set to change the sample distribution,the typical method is re-sampling technology;2)on algorithm level,i.e,redesign the current algorithms to make them adaptive,the typical methods are cost-sensitive learning and Boosing-based algo-rithms.Among these algorithms,ensemble learning plays an important role.Through decades of research,this field has achieved a lot of prominent accomplishment.How-ever,there still exist some shortages,like over-fitting and information loss,etc,which may impact the stability and reliability of classifier.Aiming at class-imbalance and cost-sensitive problem,the mainly finding of this paper are as follows.· propose two ensemble models based on resampling:xEnsemble and RSEnsem-ble.After introducing the theoretical basis,we optimize the current algorithm and propose two novel methods.Furthermore,we prove their effectiveness from the perspective of bias-variance and error-ambiguity,respectively.· apply xEnsemble and RSEnsemble to a real-life diabetes diagnosis dataset.This dataset is large-scale,high imbalanced and cost-sensitive.After ensuring the eval-uation criterions,we preprocess the dataset and conduct a series of experiments.Finally,the results prove that our methods have better performance than other relevant methods.
Keywords/Search Tags:Machine Learning, Class-imbalance Learning, Cost-sensitive Learning, Resampling, Ensemble Learning
PDF Full Text Request
Related items