Font Size: a A A

Research And Application In Text Classification Based On Random Forest

Posted on:2019-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q L ZhangFull Text:PDF
GTID:2428330566487287Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and the rapid popularization of the Internet,people has created a large amount of information and data on the digital network.The information has shown an explosive growth and we are entering the era of big data.Faced with such a huge amount of data,people urgently need an effective method of organizing and managing knowledge discovery and mining of implicit associations.Text classification is just a key part of solving these problems.Random forest is an integrated learning method proposed by Breiman in 2001.It is an integrated classifier method by combining multiple decision trees.The basic idea is the bagging and random subspace method.Compared with other classification algorithms,random forest has high classification accuracy,overcoming the problem of over fitting,good tolerance to noise and outliers and easy parallelization.Based on these advantages,Random Forest has been widely used and has achieved good results in the field of text categorization.However,the traditional random forest algorithm can not discriminate between different classifiers based on classification performance,which has some influence on the classification performance of random forest algorithms because the classification performance of each basic classifier is uneven.If the same weight voting,can not reduce the negative influence of the basic classifier with poor classification performance,and can not enhance the positive influence of the basic classifier with excellent classification performance,which eventually leads to the decline of the overall classification performance of the random forest.At the same time,random forest will also have poor performance in the face of multi-category data of high-dimensional features.Choosing features with the same probability will have a greater negative impact on classification performance when feature subsets are selected relatively small.When the size of the feature subset is limited,it is more difficult to select the distinguishable features,resulting in a much lower classification strength of the base classifier,thereby affecting the generalization ability of the random forest.Combining the above two aspects,this paper proposes a random forest algorithm based on weighted voting and weighted feature selection,which has the advantages of enhancing impact of the high reliability of the basic classifier and reducing the impact of the low reliability of basic classifier in the voting phase,and in the stage of feature selection,the distinguishable feature has a greater probability of being selected into the feature subse.Experiments show that the proposed algorithm has better classification performance than other random forest algorithms and other commonly used classification algorithms,but the disadvantage is that the running time is relatively long.
Keywords/Search Tags:random forest, text classification, weighted voting, weighted feature selection, high dimensional feature
PDF Full Text Request
Related items