Font Size: a A A

Research And Application On Imbalanced Data Set Classification Problems

Posted on:2015-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:B Y SunFull Text:PDF
GTID:2268330425996666Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The classification problem of imbalanced data set is one of the most challengingresearch problems in the field of data mining and machine learning. In recent years,with the development of computer technology and the progress of informationtechnology, more and more decisions need the support from data. In the age of bigdata, classification that is based on data mining techniques has become the powerfulmeans of immediate decision-making, precision marketing and even improving thecomprehensive competitiveness. Imbalanced data set is a form that exists in realityarea, which describes truly and objectively the essential characters of something. In aword, people just care about a little part from the huge data, but this little part of datausually hides within the huge data, which makes us cannot separate them accurately.Imbalance data classification problem is a difficult problem about data mining, andthere are many common classification strategies for the traditional classificationproblem which cannot deal with the imbalanced data set well, so it attracts more andmore attention of experts and scholars all over the world.This thesis firstly introduce the conception of imbalanced data set and theprogress of imbalance data classification problem that is being studied by experts andscholars in the world, and it explains the reasons why imbalance dataclassification problem is so difficult to work out, the treatments we often adopt aboutthis problem, and the evaluating metric of classification performance. With fairlyconsideration about the shortage of the imbalanced data information, datesubmerging and the information loss after taking samples, this paper proposes astrategy for re-sampling of the imbalanced data, which bases on the resampling ofthe boundary of clustering. Combined with the ensemble learning method based onsupport vector machine, two aspects from data and algorithm are putting forward tosolving the imbalanced data classification problem. In the section of verification andanalysis of experiments, with four typical forms of imbalanced data sets werevalidated the effectiveness of this strategy. Finally, combined with the ensemblelearning method will imbalance data classification problem applied to a telecom customer relationship prediction, using the real data of telecom customer relationship,the specific sampling and classification strategy is integrated into the system, alsohas better classification effect in practical application.
Keywords/Search Tags:imbalanced data, classification, re-sampling, ensemble learning
PDF Full Text Request
Related items