Font Size: a A A

Research On Oversampling Method For Multi-class Imbalanced Learning

Posted on:2020-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z L JiangFull Text:PDF
GTID:2428330611494708Subject:Engineering
Abstract/Summary:PDF Full Text Request
Imbalanced learning is one of the most important researches in machine learning.Imbalanced data is skewed in the distribution of classes,and minority classes have higher value.However,under the influence of majority classes,traditional machine learning algorithms have low recognition rate for minority classes and cannot effectively deal with imbalanced data.The oversampling is a method to solve the imbalance learning problems effectively.Its main idea is to generate samples for minority classes and balance the number of minority class with that of majority class.Although researchers have achieved a lot of results in oversampling methods and widely applied,most of the existing oversampling methods will cause over-generalization when dealing with multi-class imbalanced data.In this thesis,two improved methods have been proposed to cope with the over-generalization of oversampling method in dealing with multi-class imbalanced learning.Two improved methods have been developed from three aspects: sampling direction selection,synthesized samples evaluation and sampling number calculation.Then a simple demonstration system to show oversampling method has been presented.The main research results and innovations of this paper are as follows,At begin,an oversampling approach based on Hellinger distance and SMOTE(HDSMOTE)is proposed.When dealing with multi-class imbalanced data,HDSMOTE can guide the direction of the synthesized samples and evaluate the quality of synthesized samples by Hellinger distance,which can reduce the risk of over-generalization.A sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.A sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and classified with C4.5 decision tree.Experiments show that the HDSMOTE algorithm has better performance on the RIPPER classifier than 7 representative oversampling algorithms.Then,A high quality oversampling framework for multi-class imbalanced learning(HQOF)is proposed.HQOF analyzes the distribution of minority class and its surrounding samples,then adaptively calculating the number of samples,which can reduce the risk of overfitting.HQOF combines the Hellinger distance decision tree to train supervise model for minority class,thus evaluating the quality of synthetic samples,which can reduce risk of over-generalization.HQOF consists of three parts: First,an adaptive sampling strategy based on Mahalanobis distance has been modelled.The sampling number is determined by analyzing the distribution of the minority class and its surroundings,which can reduce the sampling number and the risk of overfitting.Second,the traditional oversampling method is used for sampling.Finally,a supervise mechanism based on Hellinger distance decision tree has been established to evaluate newly synthesized samples,which can reduce the risk of over-generalization.Embed 7 representative oversampling algorithms into HQOF.19 multi-class imbalanced data sets have been preprocessed by 6 new oversampling methods and 7 original oversampling methods.Then preprocessed data classifies with C4.5 decision tree and naive Bayesian.The results show that the HQOF can reduce the sampling number and ensure the validity of the oversampling.Finally,a simple demonstration system has been developed,which consists of two modules for oversampling and classification.The oversampling module encapsulates 8 oversampling methods,and the classification module encapsulates six classifiers.In a whole,the system implements the functions of oversampling and data classification,and the running results are presented to the user in a graphical manner.
Keywords/Search Tags:Oversampling, Imbalanced learning, Adaptive sampling, Supervision mechanism, classification
PDF Full Text Request
Related items