Font Size: a A A

Classification Research For Unbalanced Data Based On Hybrid-sampling

Posted on:2015-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y Y OuFull Text:PDF
GTID:2298330422472053Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification research for unbalanced data sets is an important issue in the field ofmachine learning and data mining. The so-called unbalanced data sets in two-classesproblem refers to that the samples of one class are far less than the other one. And theclass having more samples is called positive class while the class having less samples iscalled negative class. Due to the serious unbalance of the number of classes, thetraditional classification methods will produce high predictive accuracy for the samplesof negative class and poor accuracy for the samples of positive class. However, therecognition rate of positive class is much more important in practice.To solve the above problem, the classification for unbalanced data sets needs newmethods. Currently, there are two strategies to solve the classification for unbalanceddata. One aims at improving the classification algorithm itself, such as the integrationmethod, cost-sensitive learning method, feature selection method and single classlearning method; the other one wish to solve problems through reconstructing data sets.SMOTE is the classic over-sampling algorithm belonging to the latter strategy.However, there are some problems in SMOTE algorithm. Sampling based on noisemay result in the introduction of new noise while unreasonable sampling can decreasethe domain of decision because of resampling. This thesis proposed a unbalanced datasets learning algorithm—SVM-IMSA based on hybrid-sampling strategy, and studiedthe following key issues:1. To solve interference of noise sample in SMOTE sampling, a hybrid-samplingalgorithm based on hybrid misclassified is proposed, which directly deletes samplesidentified as noise based on the spatial relationship between the neighbors.2. In view of blind sampling and unreasonable sampling of existing SMOTEalgorithm leading to the overlapping problem of sample space, based on misclassifieddriven, the thesis self-adaptively adopts over-sampling and under-sampling method forsafe and danger points through dividing misclassified samples into safe points, noisepoints and dangerous points according to the spatial neighborhood relationship to solveoffset problem on decision surface of support vector machine on imbalanced data sets.3. The random linear interpolation of SMOTE may result in that sparse regions stillremains sparse and dense region is still concentrated, which cannot effectively dosampling on more meaningful sample area. Based on iteration on, the thesis gradually concentrates the sampling area to samples which is difficult for distinguish, andincreases the sampling area to samples which is difficult for distinguish, and increasesthe sampling rate for samples identified as security to make the classifier pay moreattention to these samples hard to classify.4. Improved the traditional random under-sampling strategy, and a boundary regioncutting algorithm is proposed. By analyzing density and density reachable of negativesample dangerous points, applying under-sampling processing to the negative classsamples, avoiding cleaning mistakenly for some important classes of negative samplesin traditional algorithm.
Keywords/Search Tags:hybrid-sampling, misclassified samples, unbalanced data, support vectormachine
PDF Full Text Request
Related items