Font Size: a A A

Research On Imbalanced Classification Algorithm Based On Generative Model

Posted on:2020-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2428330590973942Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The imbalanced problem refers to the uneven distribution of data in different classes in the dataset.The traditional classification algorithm is based on the assumption that the data is roughly balanced.Therefore,the attention of the minority data is not enough,ignoring the precious information they contain and affecting the classification performance.The researchers proposed corresponding solutions from the data level and algorithm level and achieved certain results.Among them,the data-level solution is part of the data pre-processing,which adjusts the data distribution by sampling algorithm to make it more balanced.However,in the data-level solution,under-sampling for the majority data may lead to information loss;random oversampling for the minority data does not guarantee consistency of data distribution before and after oversampling;oversampling based on probability distribution function need to assume the data distribution form,so the algorithm is limited.In addition,the data generated by oversampling algorithm,which is separate from the classification algorithm,only guarantees the balanced data,but cannot ensure the improvement of the classifier performance.The following three aspects are studied in this paper:Aiming at the problem that the imbalance rate cannot reflect the data distribution,this paper proposes an improved generalized imbalance measurement.In this paper,the process of calculating the mean number of nearest neighbor with same label in the generalized imbalance is calculated by the weighted distance,and the product of the positive and negative subset measurement is used rather than the difference among them.The improved generalized imbalance measurement increases the correlation between the imbalance measurement and the classification result.Aiming at the problem that the oversampling algorithm based on data distribution needs hypothesis of data distribution.An oversampling method based on variational auto-encoder is proposed.The variational auto-encoder is used as the probability estimation function of data distribution.As the variational auto-encoder is differentiable,so it can only generate numerical data.The proposed method generates the numerical feature and non-numerical feature separately,and improves the classification performance of the classifier.As the sample generated by the separate oversampling algorithm cannot ensure the improvement of the classification performance.This paper proposes an oversampling classification framework based on the variational auto-encoder,using the incremental logistic regression classifier,the generated sample quality is converted into the classification performance of the desired classifier,and a joint training framework of generation and classification is creatively proposed,and the parameters of the generator are adjusted according to the classification performance of the desired classifier to ensure the improvement of the classification performance.The proposed method combines the sample generation into training classifier,and does not need to set the oversampling rate to avoid an improper oversampling rate and its negative effects on the classification.The experimental results verify that the oversampling classification framework based on the variational auto-encoder can improve the F1 value of minority data.
Keywords/Search Tags:imbalanced problems, over-sampling, imbalance measurement, variational auto-encoder
PDF Full Text Request
Related items