Research And Application In Text Classification Based On Random Forest

Posted on:2019-12-19

Degree:Master

Type:Thesis

Country:China

Candidate:Q L Zhang

Full Text:PDF

GTID:2428330566487287

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology and the rapid popularization of the Internet,people has created a large amount of information and data on the digital network.The information has shown an explosive growth and we are entering the era of big data.Faced with such a huge amount of data,people urgently need an effective method of organizing and managing knowledge discovery and mining of implicit associations.Text classification is just a key part of solving these problems.Random forest is an integrated learning method proposed by Breiman in 2001.It is an integrated classifier method by combining multiple decision trees.The basic idea is the bagging and random subspace method.Compared with other classification algorithms,random forest has high classification accuracy,overcoming the problem of over fitting,good tolerance to noise and outliers and easy parallelization.Based on these advantages,Random Forest has been widely used and has achieved good results in the field of text categorization.However,the traditional random forest algorithm can not discriminate between different classifiers based on classification performance,which has some influence on the classification performance of random forest algorithms because the classification performance of each basic classifier is uneven.If the same weight voting,can not reduce the negative influence of the basic classifier with poor classification performance,and can not enhance the positive influence of the basic classifier with excellent classification performance,which eventually leads to the decline of the overall classification performance of the random forest.At the same time,random forest will also have poor performance in the face of multi-category data of high-dimensional features.Choosing features with the same probability will have a greater negative impact on classification performance when feature subsets are selected relatively small.When the size of the feature subset is limited,it is more difficult to select the distinguishable features,resulting in a much lower classification strength of the base classifier,thereby affecting the generalization ability of the random forest.Combining the above two aspects,this paper proposes a random forest algorithm based on weighted voting and weighted feature selection,which has the advantages of enhancing impact of the high reliability of the basic classifier and reducing the impact of the low reliability of basic classifier in the voting phase,and in the stage of feature selection,the distinguishable feature has a greater probability of being selected into the feature subse.Experiments show that the proposed algorithm has better classification performance than other random forest algorithms and other commonly used classification algorithms,but the disadvantage is that the running time is relatively long.

Keywords/Search Tags:

random forest, text classification, weighted voting, weighted feature selection, high dimensional feature

PDF Full Text Request

Related items

1	Random Forest In Application Of Text Classification
2	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
3	Adaptive Weighted KNN Text Classification
4	Research On Random Forest Algorithm Based On Feature Selection And Diversity
5	Study On The Application Of Random Forests In Text Classification
6	Research On Feature Selection And Classification Method Based On Random Forest For Medical Datasets
7	Text detection in natural scenes through weighted majority voting of DCT high pass filters, line removal, and color consistency filtering
8	Evaluation Of Confounder-controlled Random Forest And Its Application In High Dimensional Data Analysis
9	Research On Feature Selection Method Based On Random Forest
10	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification