Font Size: a A A

Research On English Text Classification Algorithm Based On Ensemble Learning

Posted on:2019-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2428330548959135Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the advent of a highly information age,text information as the main carrier of information transmission on the network,its organization and management can not only make the massive text information on the network be classified and stored according to the theme,but also enable users to find their needs efficiently and conveniently information.In order to meet the above requirements,we need to summarize existing text classification algorithms,analyze their applicable scenarios,and optimize them based on these algorithms to achieve the purpose of improving the accuracy of text classification.This paper first systematically sorts the process of text classification,and summarizes the technologies involved in preprocessing,feature selection,similarity calculation,text representation,and classifier algorithm.Several classifier models commonly used in the field of text classification are introduced: Naive Bayes,Support Vector Machines,KNN and Neural Networks.Introduced the performance evaluation indicators of the classifier.As one of the most popular machine learning methods at present,ensemble learning can improve the generalization performance and classification accuracy of the classifier by building multiple base classifiers with better classification performance to train the training data set.It is often used for classification prediction,regression problems,feature selection,outlier detection.Integrating learning can be roughly divided into two kinds: sequence integration method and parallel integration method according to the dependency relationship among the base classifiers.This paper systematically studies the theoretical basis of the parallel integration method bagging and the sequence integration method boosting,and introduces the bagging extension algorithm “random forest” in detail.The innovation of this paper focuses on the deficiencies in the random forest algorithm.This paper optimizes the random forest algorithm and proposes an algorithm based on out-of-bag error rate and the classification effect of decision trees in random forests to assign weights to random forests:OOB-WRF.In addition,this paper also introduces the principle of AdaBoost algorithm,which is the most popular Boosting algorithm.Ada-NB,a lifting algorithm based on Na?ve Bayesclassifier as the base classifier,is proposed.The advantages of this algorithm in text classification are described.Correcting the bias of the Bayesian classifier as a base classifier,a Bayesian text classification algorithm RCF-NB that corrects the class word frequency is proposed.Finally,the Ada-RCFNB algorithm for correcting the word frequency of the modified category is proposed in combination with the adaptive lifting algorithm.In order to verify the effectiveness of the three optimization algorithms proposed in this paper(OOB-WRF,Ada-NB,Ada-RCFNB),the English Newsgroups corpus was chosen for experimental verification.Experimental results show that OOB-WRF algorithm has better classification accuracy and F1 value than traditional random forest algorithm.By comparing Na?ve Bayes,Ada-NB,and Ada-RCFNB three classifiers,the experimental results show that Ada-NB algorithm can improve the accuracy of Bayesian classifiers.The Ada-RCFNB algorithm has better classification accuracy than Ada-NB.Can further improve the accuracy of the Bayesian classifier.
Keywords/Search Tags:English Text Classification, Ensemble Learning, Random Forest, Adaboost, Naive Bayes
PDF Full Text Request
Related items