Font Size: a A A

Research On Ensemble Approach For Classification Of Imbalanced Data Sets

Posted on:2018-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:J W GuoFull Text:PDF
GTID:2428330566498945Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Imbalanced dataset means that the number of instances in a certain class is much larger than that of other classes.Classification of imbalanced dataset is a common task of machine learning and pattern recognition.Most of traditional algorithms for classification are based on the assumption that the sample size of each class is roughly balanced.These classifier often fail to achieve good results.Recently,many researchers proposed a several approaches to solve the imbalanced problem and achieved good results.The hybrid sampling methods at the data level take the advantages of under sampling and oversampling.Hybrid sampling can solve the problem of sample invading other clusters caused by oversampling.Current researches on the hybrid sampling method is not sufficient.Many hybrid sampling processes are carried out separately,ignoring the problems such as the different importance of minority samples caused by intra class imbalance and the ratio of positive and negative samples are difficult to determine.The ensemble framework is widely used to solve the problem of classification of imbalanced datasets.The diversity of sampling combination will also have an impact on the effect of ensemble learning.In this paper,we study the hybrid sampling method at data level and the combination of data preprocessing and ensemble framework at algorithm level to solve the above problems.In this paper,we take consideration on the characteristics of im balanced datasets and hybrid sample.An ensemble learning scheme based on hybrid sampling is proposed.The existing hybrid sampling methods haven't take the interaction of the two sampling methods into consideration.According to the different importance of minority class instances when sampling,we use evolutionary algorithms to monitor the process of hybrid sampling.The chromosome represents a combination of hybrid sampling.Binary code is used to encode the sampling ratio of minority class samples.The combination of under and over sampling is evaluated by the fitness function.So the sampling rate of every minority class sample can be optimized.Based on the above strategies we proposed a new hybrid sampling method.According to the definition of search space of evolutionary algorithm,our method can solve the noise samples,boundary samples and intra class imbalance problems of minority class samples.Ensemble learning has an advantage in solving the imbalanced classification problem.Sampling techniques can be combined by using ensemble learning to determine the imbalance distribution of the samples.By taking the truth that Ada Boost framework is sensitive to the diversity of weak learners into consideration,we propose an ensemble algorithm based on evolutionary hybrid sampling.The proposed algorithm's fitness function contain the diversity factor of a sampling solution of hybrid sampling.Experiments on 16 datasets verify the effectiveness of the evolutionary hybrid sampling.Comparing AUC values of our proposed algorithm and other ensemble algorithms on solving the imbalanced datasets classification problem,it verified the effectiveness of our proposed ensemble method of evolutionary hybrid sampling.
Keywords/Search Tags:imbalanced dataset, hybrid sampling, ensemble classification, evolutionary algorithms
PDF Full Text Request
Related items