Font Size: a A A

Imbalanced Data Classification Based On Multi-classifier Ensemble And Semi-supervised Learning

Posted on:2017-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:X Y XiangFull Text:PDF
GTID:2348330482491338Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet media and technology, the Internet is full of a huge amount of information,and it is also developing and updating continuously. It has been an important subject in machine learning how to get the information that meet the user's needs from masses and updating big data. In the meantime, the imbalanced data classification problem has become increasingly prominent. The traditional machine learning method is not suitable for solving such problems. So there are great challenges in the research of the imbalanced data classification.In general, there are mainly two popular methods to solve the imbalanced data classification problem, relevant intensive studies are conducted based on the data set and algorithm respectively.The main work of this paper for solving this problem is as follows:1. Imbalanced Data Classification Based on Multi-classifier EnsembleMulti-classifier ensemble is one of the key technology used for solving the imbalanced data classification problem. In order to ensure its ensemble learning performance, it is very necessary to improve these two aspects: on the one hand, we should enhance the classification accuracy of single weak classifier. On the other hand, we should enhance the differentiation and diversity between them.For the imbalanced data set, positive and negative samples have an unbalanced distribution and this makes the classifier a very low recognition rate in rare class. Moreover it results in the classifier's performance is not good. According to this circumstance, this paper proposed a new multi-classifier ensemble approach which is based on the KPCA and RST. Firstly, extract features and select the most representative features to achieve dimension reduction of the imbalanced data set. Then,changes the training set sample distribution to reduce the imbalanced degree based on the reconstruct data set method and this can enhance the classification accuracy of single weak classifier. The differentiation and diversity between them is also enhanced in some way because of the random sampling is used in the reconstruct data set method.2. Imbalanced Data Classification Based on Semi-supervised LearningIn the imbalanced data classification problem, the number of rare class samples is very scarce. A semi-supervised learning method was applied to solve solving the imbalanced data classification problem in this paper in order to use the abundant unlabeled data in the data set effectively. Firstly,three different classifiers are used for collaborative training in the improved Tri-training algorithm inorder to improve the differentiation and diversity. Then, a weighted voting system which is based on the classification accuracy is used to ensemble the classifier, and it improves the accuracy rate of forecast samples. Finally, the experiments results show that the improved method proposed in this paper can make a good high accuracy rate and recall rate in dealing with the imbalanced data classification.
Keywords/Search Tags:Imbalanced data classification, Multi-classifier ensemble, Semi-supervised Learning, Co-training, Tri-training
PDF Full Text Request
Related items