Font Size: a A A

Research On Imbalanced Data Classification Based On Monte Carlo Neural Network Algorithm

Posted on:2020-12-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:F FengFull Text:PDF
GTID:1368330620451659Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
Classification algorithms are widely used in pattern recognition,prediction and other fields.In the traditional classification problem research,they are usually based on the following assumptions:(1)the number of samples in different categories is approximately the same;(2)the cost of misclassification in different categories is approximately the same.In practical applications,the above two assumptions are often not true,and the data obtained is usually imbalanced data.If the traditional classification algorithm is used to classify imbalanced data,the desired effect will not be achieved,especially the recognition rate of minority class samples is relatively low,but in practical applications,minority class samples(such as intrusion behavior,cancer patients)are very important data,which must have a high recognition rate.According to the papers published at the top international conference of machine learning on imbalanced data classification in recent 10 years,researchers pay more and more attention to imbalanced data classification.Although the researchers have proposed sampling algorithm,cost sensitive algorithm and feature selection algorithm to solve the problem of imbalanced classification,the problems of overfitting,category imbalance and feature redundancy are still serious,it is of great practical significance to study the classification of imbalanced data.This thesis mainly studies the imbalanced data classification algorithm based on Monte Carlo Neural Network from three levels: data,algorithm and feature.The details are as follows:(1)Aiming at the overfitting problem of traditional classification algorithm in imbalanced classification,Monte Carlo Neural Network Algorithm(MCNNA)is used to study.Monte Carlo Neural Network Algorithm guides the training of Monte Carlo Neural Network Algorithm with the strategies of Empirical Risk Minimization,Structural Risk Minimization and Design Risk Minimization to improve overfitting.In addition,Monte Carlo Neural Network Algorithm improves the accuracy of the algorithm by selecting more hidden layer nodes.Generally,the accuracy of the algorithm will not increase after the hidden layer nodes increase to 10 times of the input nodes.Experiments on the phishing data set show that the Monte Carlo Neural Network Algorithm is better than the traditional classification algorithm in improving the overfitting.Compared with Naive Bayes algorithm,Monte Carlo Neural Network Algorithm improves the overall recognition rate by 6.48%,the true positive rate by 6.78%,and the false positive rate by 6.25%.In general,the Monte Carlo Neural Network Algorithm is compared with other seven classification algorithms in better performance on multiple indicators.(2)Aiming at the problem of category imbalance and feature redundancy of imbalanced classification,a hybrid oversampling feature selection algorithm based on Monte Carlo Neural Network is proposed——Negative Binary General(NBG)Algorithm.The NBG algorithm selects minority class samples and its nearest majority class samples to generate effective samples by the oversampling algorithm to improve category imbalance problem,extracts key features through Binary Ant Lion Algorithm to remove redundant features,and uses Monte Carlo Neural Network Algorithm as the classification algorithm to improve the overfitting.The classification experiments are performed on seven imbalanced data sets.Compared with the traditional classification algorithm,the NBG algorithm can more effectively remove redundant features,improve class imbalance,and has better performance in the classification performance of minority class samples.Among them,on the dataset breast_tissue,bupa,cleveland,ecoli01VS235,glass4,wisconsin,glass6,compared with Monte Carlo Neural Network Algorithm,the NBG algorithm has improved the recognition rate of minority class samples by 62.5 %,17.24%,66.67%,24%,33.33%,4.17%,33.33%,respectively.(3)Aiming at the problem that the traditional classification algorithm uses the same misclassification cost for different types of samples,resulting in a low recognition rate of minority class samples,Cost Sensitive Monte Carlo Adaptation(CSMCA)is proposed.The CSMCA algorithm optimizes the Monte Carlo Neural Network Algorithm by selecting minority class samples of cost parameters and extracting key features through the Binary Ant Lion Optimizer Algorithm to solve the problem of cost parameter selection and feature redundancy in the imbalanced data classification.The classification experiments are performed on seven imbalanced data sets.Compared with traditional classification algorithms,the cost parameter selected by the CSMCA algorithm is more effective,the overall classification performance is better,and the classification performance of minority class samples can be significantly improved.Among them,on the dataset breast_tissue,bupa,cleveland,ecoli01VS235,glass4,wisconsin,glass6,compared with Monte Carlo Neural Network Algorithm,the CSMCA algorithm has improved the recognition rate of minority class samples by 62.5 %,17.24%,66.67%,28%,33.3%,4.17%,33.33%,respectively.(4)Aiming at the problems of class imbalance,feature redundancy,and overfitting in the high-dimensional small sample imbalance classification,the Union Information Negative(UIN)algorithm and Union Information Cost(UIC)algorithm are proposed by combining the Filter feature selection algorithm and the Wrapper feature selection algorithm.Both UIN algorithm and UIC algorithm extract the effective features by unioning the Information Gain and the Gini Index.The difference is that the UIN uses the NBG algorithm as the Wrapped feature selection algorithm improves category imbalance and overfitting problems,the UIC uses the CSMCA algorithm as Wrapper feature selection algorithm to improve category imbalance and overfitting problems.The UIN algorithm and UIC algorithm are used to perform classification experiments on seven high-dimensional small sample imbalanced genetic data sets.The results show that the two algorithms can effectively improve the class imbalance,reduce the negative effects of overfitting,extract the key features.On the seven gene datasets,the overall recognition rate and minority class samples recognition rate of UIN algorithm and UIC algorithm are both 1,which is better than the traditional classification algorithm.
Keywords/Search Tags:Monte Carlo Neural Network Algorithm, imbalanced classfication, oversampling, cost-sensitive learning, feature selection
PDF Full Text Request
Related items