Font Size: a A A

Research On Integrated Classification Algorithm Based On Rough Set Attribute Reduction

Posted on:2017-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2358330503488917Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The main task of automatic text classification(TC) is to classify unstructured data into the correct classification system. TC is widely used, such as automatically classifying the news in the news portal, personalized ads recommendation, spam filtering, the resources management in the digital library, etc. Ensemble learning is one of the main research directions of machine learning. The main idea of ensemble learning is to classify the samples by using a number of base classifiers. Then we use a combination of the results of all base classifiers as the final result. Compared with the single classifier, ensemble learning has better performance in the accuracy of classification and the ability of generalization. This thesis applied ensemble learning to the text classification, studied the related theories of TC and ensemble learning, and introduced the key technology in detail. We proposed a feature selection method which was suitable for high dimensional data, and two kinds of ensemble learning algorithms. Specifically, the main work in this thesis included:1. A high-dimensional and sparse matrix was obtained after converting the collection of texts to vector space model. There are some shortcomings when reducing the feature dimension by using information gain or rough set reduction separately.In this thesis, the information gain and rough set reduction were combined,and a two-steps feature selection method based on rough set theory were proposed.This method could use the two feature selection methods to their advantages and maximumly filter redundant features.2. This thesis proposed the re-sampling re-attributing ensemble classifiers(RRE_Classifiers) algorithm. This algorithm used the Bagging and Random Forests for reference, the first step was to take repeated bootstrap samples from the original training set, the second step was to take samples of features from the result of the first step, then the final training set was obtained. The set of classifiers was different from others, this method could use more classifiers than the Bagging algorithm, and had better performance.3. This thesis proposed the error pool based Bagging classifiers(EBB_Classifiers). This algorithm maintained an error pool, there were some samples that were error classified by the former classifiers. Then some samples were randomly selected and put into the follow-up training set. The former results were helpful forthe follow-up training.4. This thesis analyzed the complexity of Bagging and the two kinds of improved algorithms. Then an experiment was conducted to compare the run time of the three algorithms.5. Some experiments were conducted on the TanCorp, the sogou corpus and the Fudan University corpus. The experiments indicated that the algorithms of features selection and ensemble classification proposed in this thesis performed well.
Keywords/Search Tags:text mining, text classification, rough set, feature selection, ensemble classification, ensemble learning
PDF Full Text Request
Related items