Font Size: a A A

High-dimensional Unbalanced Data Set Classification Algorithm Based On Support Vector Machine And Its Application

Posted on:2021-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:K Y LuoFull Text:PDF
GTID:2518306461973819Subject:Business Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of computer storage and data collection technology,the massive data generated in application fields such as genomics,financial early warning,text classification,customer churn prediction,and spam identification have shown the dual characteristics of high dimension and class imbalance.In transforming these highly complex data sets into information with application value,traditional machine learning and data mining techniques face severe challenges.To this end,this paper addresses the difficulties faced by the classification of high-dimensional imbalanced data,and studies the following:Firstly,in view of the class imbalance problem of the data set,this paper proposes a circular sampling algorithm FTL-SMOTE.The existing traditional oversampling technology only balances the data based on the statistical properties of the data,and has nothing to do with the subsequent classification algorithm,which results in the balanced data set may not be suitable for the classifier.Therefore,the algorithm takes the classification results of SVM classifier into account in the sampling process,that is,under the supervision of the SVM classifier,different strategies are used to accurately sample cyclically the minority class samples of correct and wrong classification based on SMOTE.In addition,in order to avoid the interference of noise samples on the sampling process,this paper proposes three principles of noise sample recognition to accurately identify noise samples and remove them during the sampling process.A large number of numerical results show that the FTL-SMOTE oversampling algorithm has better classification effect than the classic SMOTE and other important sampling algorithms and the standard SVM.Secondly,aiming at the problem of high-dimensional imbalance of the data set,this paper proposes a combined model of FTL-SMOTE+ISVM-RFE(FPD).First,in order to overcome the class imbalance problem of the data set,FTL-SMOTE algorithm is used to balance the data set.Then,on the balanced data set,this paper proposes a new wrapper feature selection algorithm ISVM-RFE(FPD),which embeds filter criteria.This algorithm is an improvement of the traditional wrapper feature selection algorithm SVM-RFE from the two aspects of feature selection ranking criteria and feature selection process.A large number of experiments on four published cancer microarray datasets show that the ISVM-RFE(FPD)algorithm is superior to SVM-RFE algorithm and the existing wrapper feature selection algorithms with embedded filtering criteria in terms ofrr_p and G values.Thirdly,this paper studies the application of the FTL-SMOTE+ISVM-RFE(FPD)combination model in the financial early warning of listed companies.With the rapid development of global economic integration and market economy,the financial early-warning data of China's listed companies show the dual characteristics of high dimension and class imbalance.In order to verify the effectiveness of the proposed algorithm in this kind of data set,this paper constructs two new financial early warning combination models ISVM-RFE(FPD)+MKSVM and ISVM-RFE(FPD)+CSMKSVM,and the oversampling algorithm FTL-SMOTE was introduced into the feature selection process of each model and in the classification process of the first model.A large number of empirical studies show that the combined model proposed in this paper is superior to other combined models in terms of dimensionality reduction and classification.Among them,the ISVM-RFE(FPD)+CSMKSVM model performs best.
Keywords/Search Tags:high dimensional imbalanced data set, feature selection, cassification algorithm, SMOTE, SVM-RFE, financial early warnin
PDF Full Text Request
Related items