Research On Integrated Classification Algorithm Based On Rough Set Attribute Reduction

Posted on:2017-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhao

Full Text:PDF

GTID:2358330503488917

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The main task of automatic text classification(TC) is to classify unstructured data into the correct classification system. TC is widely used, such as automatically classifying the news in the news portal, personalized ads recommendation, spam filtering, the resources management in the digital library, etc. Ensemble learning is one of the main research directions of machine learning. The main idea of ensemble learning is to classify the samples by using a number of base classifiers. Then we use a combination of the results of all base classifiers as the final result. Compared with the single classifier, ensemble learning has better performance in the accuracy of classification and the ability of generalization. This thesis applied ensemble learning to the text classification, studied the related theories of TC and ensemble learning, and introduced the key technology in detail. We proposed a feature selection method which was suitable for high dimensional data, and two kinds of ensemble learning algorithms. Specifically, the main work in this thesis included:1. A high-dimensional and sparse matrix was obtained after converting the collection of texts to vector space model. There are some shortcomings when reducing the feature dimension by using information gain or rough set reduction separately.In this thesis, the information gain and rough set reduction were combined,and a two-steps feature selection method based on rough set theory were proposed.This method could use the two feature selection methods to their advantages and maximumly filter redundant features.2. This thesis proposed the re-sampling re-attributing ensemble classifiers(RRE_Classifiers) algorithm. This algorithm used the Bagging and Random Forests for reference, the first step was to take repeated bootstrap samples from the original training set, the second step was to take samples of features from the result of the first step, then the final training set was obtained. The set of classifiers was different from others, this method could use more classifiers than the Bagging algorithm, and had better performance.3. This thesis proposed the error pool based Bagging classifiers(EBB_Classifiers). This algorithm maintained an error pool, there were some samples that were error classified by the former classifiers. Then some samples were randomly selected and put into the follow-up training set. The former results were helpful forthe follow-up training.4. This thesis analyzed the complexity of Bagging and the two kinds of improved algorithms. Then an experiment was conducted to compare the run time of the three algorithms.5. Some experiments were conducted on the TanCorp, the sogou corpus and the Fudan University corpus. The experiments indicated that the algorithms of features selection and ensemble classification proposed in this thesis performed well.

Keywords/Search Tags:

text mining, text classification, rough set, feature selection, ensemble classification, ensemble learning

PDF Full Text Request

Related items

1	Short Text Classification Method Based On Ensemble Learning
2	Research On Short Text Classification Based On Ensemble Learning
3	Ensemble Selection Of Decision Trees And Applications In Unbalanced Text Classification
4	Study Of Chinese Text Classification
5	Research On Government Text Classification Algorithm Based On Ensemble Learning
6	Text Classification Research Based On Improved LSTM And Ensemble Algorithm
7	Research On Text Classification Methods Based On Extreme Learning Machine
8	Research On English Text Classification Algorithm Based On Ensemble Learning
9	A Study Of Text Classification Algorithms Based On Feature Selection
10	Research On Cross-Domain Text Classification Of Tendency Analysis Based On Ensemble Learning