Font Size: a A A

Research And Application Of Imbalanced Data Based On Support Vector Machine

Posted on:2015-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2298330452467973Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In many practical applications, such as intrusion detection, medical diagnostics,fault detection, etc. The object of classification is mostly imbalanced data, which thenumber of samples in some classes is significantly less than the number of the otherclasses of data sets, and usually the information which the minority classes contains ismore important. However, the classification decisions always incline to the majorityclasses when we apply the traditional classification methods to classify the imbalanceddata, resulting in a low recognition rate to the samples of the minority classes. Therefore,how to effectively improve the classification accuracy of minority class samples andoverall classification performance have become a hotspot and difficulty in the field ofmachine learning and practical applications.SVM (Support Vector Machine), which is dealt with finite samples and based onStatistical Learning Theory, is a new classification algorithm. Based on analyzing theshortages of SVM algorithm in solving classification problems on imbalanced dataset,three novel improved methods are presented, and a prototype system for handling theclassification problems on imbalanced data is also built.The first improved method based on cluster-weight technology and based-gradingSVM classifier (short as WSVM).When preprocesses, uses K-means algorithm based onweight assignment model to obtain the weights of the majority samples. Classification isconsisted of three phases. Firstly, select the located in each cluster boundary majoritysamples, which is equal with the minority samples in quantity, then classify the minoritysamples and select samples, and lastly adjust the initial classifier through the unselectedmajority samples, when it comes to satisfy the explicit stopping criteria, we get the finalclassifier. A large amount of experiments by the UCI dataset show that WSVM can significantly improve the identification rate of the minority samples and overallclassification performance.The second improved method based on analyzing the shortages of SMOTE(Synthetic Minority Over-sampling Technique), an improved SMOTE (GA-SMOTE) ispresented. GA-SMOTE lies on leading the three basic genetic operators of GeneticAlgorithm into SMOTE, making use of the selection operator to achieve the differentselected from minority class and depending on crossover operator and mutation operatorto realize the fine control of the synthesis quality to the minority class samples.GA-SMOTE and SVM (Support Vector Machine) are combined to handle theclassification problem on imbalanced datasets. A large amount of experiments by theUCI datasets show that GA-SMOTE promises prominent synthesis effect to theminority class samples, and brings better classification performance on imbalanceddatasets with SVM.The third improved method based on analyzing the shortages of SVM, an improvedKNN-SVM that combined Support Vector Machine (SVM) with K Nearest Neighbor(KNN) is presented to improve the accuracy of imbalanced classification nearby SVMhyper-plane. In the class phase, the algorithm computes the distance from the testedsample to the optimal super-plane of SVM in the feature space. If the distance is greaterthan the given threshold, the tested sample will be classified on SVM; otherwise theSVs from different categories are used as the tested sample of nearest neighbors, thetested sample will be classified on KNN. A large amount of experiments by the UCIdataset show that the algorithm can significantly improve the identification rate of theminority samples and overall classification performance.Finally, a prototype system for dealing with the classification problems onimbalanced dataset is developed by using the above three improved methods. Thesystem consists of three parts: loading and preprocessing module, classifying moduleand visualization and controllability module. The testing and operation on the realimbalanced dataset show that the system has good performance and user experience.
Keywords/Search Tags:SVM, KNN, imbalanced datasets, weight assignment model, GeneticOperators, SMOTE algorithm
PDF Full Text Request
Related items