Font Size: a A A

Random Forest In Application Of Text Classification

Posted on:2016-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:J HeFull Text:PDF
GTID:2308330479994835Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Random forest is an integrated classification algorithm which using decision trees asbased classifier, it combined Bagging algorithm and random subspace method. Since randomforest appeared, it has been used to solved kinds of classification issues, including theimportant part of information retrieval: text classification.This dissertation introduced and analyzed traditional random forest algorithm, andelaborated shortcomings of traditional random forest algorithm as following:1: Unable to enhance the impacts of decision trees which have excellent classificationperformance on classification results, and unable to weaken the impacts of decision treeswhich have poor classification performance on classification results.2: Do not have effective schemes to prevent and handle the phenomenon which multipletypes receive the highest number of ballots that make it difficult to choose the the finalclassification result. The above phenomenon is defined as "Draw Phenomenon" in thisdissertation.Aim at above shortcomings, this dissertation improves the traditional random forestalgorithm as following measures:1: Change the voting method of the classification process from simple majority votingmethod to weighted voting method, the weights of decision trees in random forest are positivecorrelation with out-of-bag correct rate.2: Add schemes to prevent and handle the draw phenomenon. The prevention scheme isincrease the precision of the weights of decision trees in random forest. The basic principle ofhandle scheme is compute classification performances of all voters(decision trees) in allkinds which received highest number of ballots when meet the draw phenomenon, use thetype which decision tree who have high classification performance selected as the finalclassification result.In order to validate the effectiveness and feasibility of improved random forest algorithmproposed in this dissertation, this dissertation used English and Chinese corpora whichcheated by "Automation Discipline Innovation Thought and Scientific Method" researchgroup in Institute of Automation Attached in Chinese Academy of Sciences to experiment.Firstly, take the text categorization experiments with improved random forest algorithm andtraditional random forest algorithm in English and Chinese corpora, the experimental resultsshow that, the improved random forest algorithm in accuracy rate, F1 value has moreadvantages than the traditional random forest algorithm. Secondly, take the text categorizationexperiments with improved random forest algorithm and C4.5, naive Bayes,k-Nearest-Neighbors algorithms in English and Chinese corpora,the experimental resultsshow that, the improved random forest algorithm in accuracy rate, F1 value has moreadvantages than the others algorithms. The above experiments indicate that the improvedrandom forest algorithm proposed in this dissertation is effective and feasible.
Keywords/Search Tags:Random Forest, Text Classification, weighted Voting Method, Draw Phenomenon
PDF Full Text Request
Related items