Font Size: a A A

Rule Extraction For Imbalanced Data Classifica- Tion Based On SVM And Its Application In Commercial Bank Failures Prediction

Posted on:2015-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:J F ZhangFull Text:PDF
GTID:2308330461952715Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
At present, the imbalanced data classification prediction problem is one important research field of data mining and pattern recognition. The data which distribution is imbalanced or number is imbalanced, or feature is imbalanced and other characteristics can be called imbalanced data. General processing methods of imbalanced data classification and prediction are data level methods, algorithm level methods and modify the evaluation criteria. The data level methods are under sampling, over sampling and hybrid sampling, etc. The algorithm level methods are mainly one class learning, cost sensitive learning, boosting learning, two phase rule induction, kernel methods, active learning and feature selection, etc. The modification evaluation methods are mainly refers to the weighting and integration of F-Measure, G-Mean and AUC-ROC.Support vector machine (SVM) is another useful method to solve this problem, however, the SVM model is considered to be an imcomprehensive black-box model, and it’s difficult to understand the SVM model through the kernel functions and corresponding parameters. Rule extraction can use understandable rule sets to explain the SVM models. This paper introduces a modified active learning algorithm based on positive support vectors method (mPPALBA) to solve this imbalanced data classification problem. Through the simulation experiments on Ripley’s data set,9 benchmark data sets and the US commercial financial data set during the period of 1996 March to 2013 June, the effectiveness and superiority of the mPPALBA algorithm have been proved through contrast experiments. On the basis of above research results, this paper puts forward the commercial bank failures prediction data mining methodology, to provide the references for the future research.The main contributions of this thesis are as follows:1. According to the low accuracy of predicting positive class by current imbalanced data classification method and the black-box problem of SVM models, this paper presents a new active learning over sampling rule extraction algorithm-mPPALBA. This algorithm combines active learning algorithm and over sampling based on positive support vectors. In this new algorithm, it bases on positive support vectors, randomly generates some new positive samples surround a random positive support vector with a certain distance, and then it extracts a set of comprehensive rules from the relabeled samples and the new generated positive samples using logical model trees. Using the F-Measure, G-Mean and AUC-ROC value of 3 different evaluation metrics, the mPPALBA algorithm was validated by experiments on the Ripley’s data set and 9 benchmark data sets. Experimental results showed that the mPPALBA algorithm performed higher prediction accuracy for unbalanced data classification of positive class than the learning-based rule extraction algorithm, ALBA algorithm, SMOTE algorithm, BSMOTE algorithm, and ensured the accuracy of negative class at the same time.2. Based on the CAMELS rating system and the experiences of banking experts, according to the characteristics of commercial bank failures prediction problem, this paper experimented on the US commercial financial data set from the Federal Reserve Bank of Chicago during the period of 1996 March to 2013 June. We found that the superiority of 1-Year bank failure rule sets and 2-Year bank failure rule sets to improve the accuracy of the positive class was not obviously better than the SMOTE, BSMOTE and AdaSyn algorithms. On the basis of this result, this paper proposed an algorithm combined SVM-RFE with mPPALBA to solve this commercial bank failures prediction problem-SVM-RFE-mPPALBA. The experiment results showed that this algorithm performed higher prediction accuracy on the US commercial bank failures prediction than the SMOTE, BSMOTE and AdaSyn algorithms. And the superiority of 1-Quar bank failure rule sets and 1-Year bank failure rule sets to improve the accuracy of the positive class was obvious.3. Combined with the general data mining methodology and the commercial bank failures prediction characteristics, this paper puts forward the commercial bank failures prediction data mining methodology, to provide guidance and reference to the future study of commercial bank failures data mining projects.
Keywords/Search Tags:Imbalanced data, Support Vector Machine, Over Sampling, Commercial Bank Failures Prediction, Active Learning, Feature Selection
PDF Full Text Request
Related items