Font Size: a A A

Selection And Classification Of Unbalanced Data Based On Semi - Supervised And Integrated Learning

Posted on:2017-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:X N FangFull Text:PDF
GTID:1108330482493381Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the big data era, machine learning and data mining technologies are facing unprecedented opportunities and challenges. As one of the core research areas in machine learning, classification problems have been drawing much attentions from researchers, and have prompted the emergence of many classical theories, algorithms models and application softwares. However, there are often uneven class distributions in obtained datasets in real applications, resulting in significantly decreased classification accuracies of traditional classifiers. These problems are known as class imbalance problems, and the sample number of one class is considerably less than another(or several) class. Class imbalance exists in many applications, such as telecommunications, the Internet, ecology, biology, medicine and etc., and it has been considered as one of the most prominent issue of data mining community.From a learning perspective, the minority class usually contains more important classification information, and faces higher misclassification cost. However, the minority class is often difficult to be identified due to the fact that a) the minority samples are always related with some unusual and important cases, b) and the cost of obtaining the minority class samples is relatively higher. On the other hand, since most traditional classification algorithms are for balanced training sets, they might generate unsatisfactory classifiers when facing with imbalance datasets.In recent years, since class imbalance problems have occurred in various application scenarios, imbalanced data classification has become one of the focus of machine learning and data mining research groups. This thesis studies the classification and feature selection problems of imbalanced data based on ensemble learning and semi-supervised learning methods. The main contributions of this paper are summarized as follows:1) For web spam problem which causes a lot of trouble for the search engine company, this thesis presents a new method that combines over-sampling method SMOTE with Random Forest SMOTERF to solve this problem. Comparative experiments on WEBSPAM-UK2007 dataset show that the classification performance of this method outperforms the others obviously, especially on AUC value. Even compared with parameters-optimized Random Forest algorithm, our method can achieve a higher AUC value. This method is simple and has strong generalization ability, and can be used to detect search engine spam pages.2) Inspired by a recently introduced efficient ensemble learning algorithm Rotation Forest, this thesis proposes three improved algorithms for imbalanced web spam detection and highly imbalanced data classification problems. Firstly, this thesis uses SMOTE to balance the original distribution of the web spam dataset, and utilizes an improved nested Rotations Forest algorithm for classification. Experimental results show that the combined method of SMOTE and nested Rotation Forest can significantly improve the classification performance on imbalanced web spam dataset. Secondly, for the classification of highly imbalanced datasets, this thesis integrates two classical imbalanced preprocessing methods random under-sampling and SMOTE into the feature extraction process of Rotation Forest, based on which we propose two improved algorithms: SROForest and RUROForest. Comparative experimental results on 22 highly imbalanced data sets show that the proposed methods significantly improve the AUC values. Nonparametric statistics also demonstrates that our methods, especially RUROForest, outperform other comparative methods.3) The problems of imbalanced class distribution and very few labeled samples often exist together in many applications, therefore, to address the classification problem of imbalanced and under-labeled web spam datasets, this thesis presents a series of methods that combines SMOTE with self-labeling techniques and multi-classifier model under semi-supervised classification framework. Comparative results on the partially labeled WEBSPAM-UK2007 datasets show that the proposed methods, especially those based on multi-classifier model, can considerably improve recall value of the spam class and overall AUC values, without substantially decreasing the overall accuracy. It represents an effective strategy to solve the classification problem with a small number of labeled samples and the class-imbalanced data sets.4) For ovarian cancer diagnosis and survival prediction problems with high-dimensional and imbalanced microarray data, this thesis presents an imbalanced filtered feature selection algorithm based on Random Forest IFSRF. During the features selection process, our algorithm chooses AUC value as an evaluation criterion, which can significantly reduce the negative impact of the imbalanced class distribution of the classification system. Experimental results show that IFSRF can not only significantly improves AUC values of all classifiers especially on Random Forest on three unbalanced datasets for the ovarian cancer diagnosis, survival prediction and recurrence prediction, but also slightly increases the overall accuracy at the same time. This method is simple and robust, and can be widely applied to the classification problems of cancer microarray datasets.In summary, this thesis mainly focuses on imbalanced data classification problems, including web spam detection, highly imbalanced data classification, diagnosis and survival prediction of ovarian cancer and so on. Corresponding solutions are proposed from several different points of views considering sample preprocessing, ensemble learning, semi-supervised learning and feature selection. The experimental results have proved their effectiveness. Our work can provide helpful information for future imbalanced classification research.
Keywords/Search Tags:Ensemble learning, Semi-supervised learning, Imbalanced classification, Feature selection, Web spam, Random Forest, Microarray data
PDF Full Text Request
Related items