Font Size: a A A

Research On Virtual Sample Generation Technology Based Of KDE And Copula Function And Its Application To Imbalanced Dataset Classification

Posted on:2020-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:S X WangFull Text:PDF
GTID:2428330602461510Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The propensity of different types of data and the information gap between samples largely restrict the accuracy and rationality of these classification algorithms.How to solve the unbalanced samples problem reasonably and effectively,and improve the classification performance of classification algorithm is a hot topic.The main reason for the data imbalance problem is the difference in the amount of samples of different categories,which leads to the offset of the decision surface of the classifier,and the lack of information in the original data space leads to insufficient feature learning of the classifier.The virtual sample technology can effectively solve the problem of the deviation of the decision planet caused by the difference amounts of the categories in the unbalanced classification problem,and can effectively fill the information interval of the original data.In the traditional unbalanced sample solution strategy,the virtual sample construction method is only a linear combination between the original samples,and resulting the data feature cannot describe the samples correctly.Therefore,this paper proposes a virtual sample generation method Copula-KDE VSG which based on Kermel Density Estimation(KDE)and Copula function to solve the problem of data skew and information loss in unbalanced classification problems.Using the kermel density estimation,the joint probability density model is constructed by estimating the marginal probability density of each dimension of data and constructing the copula flunction.New virtual samples are generated according to the constructed joint probability density,and the virtual samples are further optimized by pseudo-marking technique.An improved Copula-KDE VSG method is proposed,and the method is proved to further enhance the reasonable generation of virtual samples.The Copula-KDE VSG can generate virtual samples that match the characteristics of the original samples and effectively fill the information interval of the samples,thereby improving the classifier's learning ability of positive samples.In this paper,two actual cases(nuclear protein localization data and banknote wavelet transform data)are used to compare SMOTE method and its improved method cluster-SMOTE with proposed method under four classifiers.And they prove that the method proposed in the article is effective,practical and advanced.The experimental results show that the virtual samples generated by this method can effectively retain the feature information of the original samples and supplement the sample interval,and prove that the virtual samples generated by this method is reasonable.
Keywords/Search Tags:unbalanced data, data classification, virtual sample generation, kernel density estimation, Copula function
PDF Full Text Request
Related items