Font Size: a A A

Research On Support Vector Machine Classification Method For Imbalanced Datasets

Posted on:2010-02-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M YangFull Text:PDF
GTID:1118360302965512Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
Support Vector Machine (SVM) is a kind of machine learning method based on statistical learning theory. Compared with traditional methods such as neural network, SVM can solve many practical problems such as high dimension, nonlinearity and local minima. So it has become a hot issue in the field of machine learning. SVM has strong theoretical foundation and can get excellent generalization ability even if the number of training sample is small. Therefore it is suitable to solve fault diagnosis problem, which is a typical limited sample learning problem. So research on fault diagnosis method based Support Vector Machine has strong theoretical significance and practical engineering meaning.In general, when the diagnosis dataset is balanced distributed, SVM can get desirable result. However, in practical application, fault samples are hard to acquire, which makes the diagnosis dataset highly imbalanced. And it is found that the classification accuracy of SVM for fault sample is much worse than that for normal sample which limits the practical application of SVM for circuit fault diagnosis problems. This dissertation aims at solving the problem that SVM cannot get desirable results for classification on imbalanced datasets. Reseach work includes two main aspectes: the data pre-processing method for imbalanced dataset and SVM modification method for imbalanced datset. Then we apply these methods in analog circuit fault diagnosis field and solve problem of SVM classification accuracy deterioration caused by imbalanced diagnosis dataset in practical application.The main innovative contributions of this dissertation are as follows.1. Synthetic Minority Oversampling TEchnique (SMOTE) is an effective over-sampling technique, but in the process of synthetic sample generating, SMOTE doesn't consider the true distribution of minority samples and it doesn't consider the distribution of majority sample in the neighborhood of minority sample either, so it is of some blindness. Therefore, a new kind of over-sampling technique——ASMOTE is proposed. Based on the distribution of the dataset, ASMOTE adjusts the neighbor selective strategy of SMOTE in order to control the quality of new samples. Simulation results show that after preprocessing the dataset by ASMOTE, classification accuracy of SVM classifier is highly improved.2. In the process of boundary data processing, traditional sample cutting technique such as one-sided selection simply removes the boundary samples from the datasets, which makes loss of classification information. For this problem, the dissertation proposes Fuzzy Sampling Cutting Technique based on K-nearest neighbor method. For the classification information loss problem occurred in traditional random undersampling method, the dissertation proposes Guided Undersampling Technique based on unsupervised learning. Experimental results show that after preprocessing datasets by the above two methods, classification accuracy of SVM for imbalanced datasets will be highly improved.3. SVM can be ineffective in classifying the minority sample when it is applied to the problem of learning from imbalanced datasets. In order to design proper SVM modification method to remedy this problem, the dissertation analyzes the true cause of that problem firstly. Then based on this, a kind of SVM modification method——μSVM is proposed. In the new method, the decision region of the minority class is enlarged by adjusting the distance measurement rule in the classifying decision function. Empirical study shows thatμSVM can augment the classification accuracy rate effectively.4. SVM's theoretical foundation is based on the nonlinear mapping from input space to a high-dimensional feature space to make the dataset linear separable, and it is very hard, sometimes impossible, to acquire the form of this nonlinear mapping. So it is difficult to implement effective modification on SVM in feature space to make it suitable to solve imbalanced classification tasks. For this problem, the thesis proposes a new kind of SVM modification method——BEF-SVM. BEF-SVM uses Biased Discriminant Analysis criterion to measure class separability for imbalanced datasets in the process of kernel optimization, so that the class separability will be enlarged, which in turn improves the prediction accuracy for minority samples.5. For the practical application research on fault diagnosis, the dissertation selects two typical circuits as diagnosis target and simulates the output waveform in PSPICE environment. Then we apply a three stage data-preprocessing method which includes Haar wavelet transform, PCA method and data normalization to extract feature from the circuits. Then these features are used to develop fault diagnosis system based on SVM. For the imbalanced classification problem occurred in practical circuit fault diagnosis application field, different setting parameters and sampling rate are applied in simulation process to generate normal samples and fault samples, then the imbalanced dataset classification methods proposed in the dissertation is applied to solve this imbalance problem. Finally the SVM classification method which is suitable to solve practical analog circuit fault diagnosis problem can be developed.
Keywords/Search Tags:support vector machines, imbalanced dataset, kernel optimization, data-preprocessing, intelligent fault diagnosis
PDF Full Text Request
Related items