High-dimensional Unbalanced Data Set Classification Algorithm Based On Support Vector Machine And Its Application

Posted on:2021-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:K Y Luo

Full Text:PDF

GTID:2518306461973819

Subject:Business Statistics

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer storage and data collection technology,the massive data generated in application fields such as genomics,financial early warning,text classification,customer churn prediction,and spam identification have shown the dual characteristics of high dimension and class imbalance.In transforming these highly complex data sets into information with application value,traditional machine learning and data mining techniques face severe challenges.To this end,this paper addresses the difficulties faced by the classification of high-dimensional imbalanced data,and studies the following:Firstly,in view of the class imbalance problem of the data set,this paper proposes a circular sampling algorithm FTL-SMOTE.The existing traditional oversampling technology only balances the data based on the statistical properties of the data,and has nothing to do with the subsequent classification algorithm,which results in the balanced data set may not be suitable for the classifier.Therefore,the algorithm takes the classification results of SVM classifier into account in the sampling process,that is,under the supervision of the SVM classifier,different strategies are used to accurately sample cyclically the minority class samples of correct and wrong classification based on SMOTE.In addition,in order to avoid the interference of noise samples on the sampling process,this paper proposes three principles of noise sample recognition to accurately identify noise samples and remove them during the sampling process.A large number of numerical results show that the FTL-SMOTE oversampling algorithm has better classification effect than the classic SMOTE and other important sampling algorithms and the standard SVM.Secondly,aiming at the problem of high-dimensional imbalance of the data set,this paper proposes a combined model of FTL-SMOTE+ISVM-RFE(FPD).First,in order to overcome the class imbalance problem of the data set,FTL-SMOTE algorithm is used to balance the data set.Then,on the balanced data set,this paper proposes a new wrapper feature selection algorithm ISVM-RFE(FPD),which embeds filter criteria.This algorithm is an improvement of the traditional wrapper feature selection algorithm SVM-RFE from the two aspects of feature selection ranking criteria and feature selection process.A large number of experiments on four published cancer microarray datasets show that the ISVM-RFE(FPD)algorithm is superior to SVM-RFE algorithm and the existing wrapper feature selection algorithms with embedded filtering criteria in terms ofrr_p and G values.Thirdly,this paper studies the application of the FTL-SMOTE+ISVM-RFE(FPD)combination model in the financial early warning of listed companies.With the rapid development of global economic integration and market economy,the financial early-warning data of China’s listed companies show the dual characteristics of high dimension and class imbalance.In order to verify the effectiveness of the proposed algorithm in this kind of data set,this paper constructs two new financial early warning combination models ISVM-RFE(FPD)+MKSVM and ISVM-RFE(FPD)+CSMKSVM,and the oversampling algorithm FTL-SMOTE was introduced into the feature selection process of each model and in the classification process of the first model.A large number of empirical studies show that the combined model proposed in this paper is superior to other combined models in terms of dimensionality reduction and classification.Among them,the ISVM-RFE(FPD)+CSMKSVM model performs best.

Keywords/Search Tags:

high dimensional imbalanced data set, feature selection, cassification algorithm, SMOTE, SVM-RFE, financial early warnin

PDF Full Text Request

Related items

1	Research On Feature Selection Algorithm For High-dimensional Imbalanced Class Data
2	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
3	Classification Learning Of Imbalanced Data Sets Based On Sampling Processing
4	Research On Classification Method Of High-dimensional Class-imbalanced Data Sets Base On SVM
5	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification
6	Research On Imbalanced Data Classification Methods For Industrial Big Data
7	BPSO-SVM Feature Selection And Its Application In Classification
8	Research On Network Intrusion Detection Method For Class Imbalanced Data
9	Feature Selection Based On Particle Swarm Optimization For High-dimensional Imbalanced Data
10	Research On Feature Selection Algorithm Of High-dimensional Data Based On Intelligent Optimization