Font Size: a A A

Imbalanced Classification Methods For Complex Distribution Characteristics

Posted on:2021-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J GuanFull Text:PDF
GTID:1368330614950702Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced classification is one of the challenges in machine learning and practical applications.It exists widely in medical diagnosis,network intrusion detection,biometric identification,fault detection,text classification,and so on.Traditional classification methods have poor performance when classifying imbalanced data.Especially,the recognition rate of the minority class is much lower.However,people focus on the minority class more than the majority class,and the misclassification cost of the minority class is usually higher than that of the majority class in practical applications.Hence,it is of great theoretical significance and application value to explore solutions to imbalanced classification problems.The fundamental reason for the degradation of imbalanced classification performance lies in the complex distribution characteristics inherent in imbalanced data,including small disjuncts,overlapping between classes,rare cases and outliers in the minority class space,and so on.Conventional classification algorithms aim to minimize the overall error rate.The scarce minority samples and complex distribution characteristics make traditional methods bias toword the majority class,which reduces the generalization performance of the minority class.In order to improve the recognition rate of the minority class and reduce the cost of misclassification without losing the overall performance,this thesis preprocesses imbalanced datasets at the data level,optimizes base classifiers at the algorithm level,proposes an abstaining classification model at the decision-making level.The main work includes the following four parts:Firstly,a hybrid resampling method based on weighted edited nearest neighbor rule is proposed from the data level to solve the problem that edited nearest neighbor rule compresses small class space due to low local density of small class samples.Considering two factors related to local distribution,namely local imbalance and spatial sparsity,different scaling distances are used for the candidate neighbors of the majority and minority class examples to increase the density of local monority class samples and reduce the density of local majority class samples.The proposed method avoids blindly deleting monority class samples in sparse areas,keeps minority class samples in the overlapping area between classes,and clean up majority class samples to alleviate the bias of classification interface.The experimental results show that the proposed hybrid sampling method can improvethe classification performance significantly.Secondly,an embedded optimization method based on undersampling bagging is proposed from the algorithm level to solve the problem that previous undersampling bagging methods ignore minority class examples in local areas when sampling and base classifiers lack sensitivity of learning minority class.The proposed embedded optimization method optimizes the geometric mean and sensitivity that are insensitive to class distribution,and utilizes the misclassified samples in out-of-bag to strengthen the local area.This makes base classifieres pay attention to the learning of local areas of the minority class and alleviates the algorithm's bias to the majority class.The experimental results show that the proposed optimization method can improve the classification performance.Thirdly,a ROC-based bounded abstaining classification model with two constraints is proposed from the decision-making level to overcome the shortcomings of previous methods that need to set the unknown cost matrix and optimize performance indicators which are sensitive to class distribution.The proposed abataining classification model constrains the rejection rates of the positive and negative classes respectively,and optimizes the area under the ROC curve.This model does not depend on the cost matrix and distinguishes the rejection rates and recognition rates of different classes,which is suitable for dealing with imbalance data.To solve the proposed rejection classification model,a linear time complexity algorithm based on ROC curve is proposed.The experimental results show that the proposed model obtains a better performance-rejection curve and a lower cost.Finally,a dual-objective optimization based bounded abstaining classification model with two constraints is proposed from the decision-making level to solve the problem that the previous rejection classification methods rely on the cost matrix and the sigle optimization target leads to poor robustness to application scenarios.The proposed mocel constrains the rejection rates of the positive and negative classes respectively,and minimizes the error rates of the positive and negative classes.The model can choose the best abataining classifier from the Pareto optimal set according to the given cost matrix,rejection constraint conditions,or required evaluation indicators,thus having strong applicability.The experimental results show that the proposed model obtains a better performance-rejection curve and a lower cost.This thesis analyzes the reasons of the imbalanced classification performance degradation,proposes imbalanced classification methods from the data level,algorithm leveland decision-making levels.The proposed methods improve the classification performance and reduce the misclassification cost.
Keywords/Search Tags:Imbalanced data classification, Weighted edited nearest neighbor rule, Ensemble learning, Bounded abstaining model, multi-objective optimization
PDF Full Text Request
Related items