Font Size: a A A

Research On Chinese Information Classification Based On Improved Bayesian Algorithms

Posted on:2020-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:X M SongFull Text:PDF
GTID:2428330572472202Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,thousands of new texts appear on the Internet.Most of the data is stored in text,and it grows exponentially,which could lead to an explosion of information.To manage such a large amount of text,the text classification problem needs to be solved urgently.Secondly,text classification based on naive bayes is based on conditional independence assumption,which is inconsistent with reality.Among many suggestions to improve its accuracy by weakening feature independence assumption,the feature weighting approach has received less attention from researchers.Moreover,all of the existing feature weighting approaches only incorporate the learned feature weights into the formula of naive bayes and do not incorporate the learned feature weights into its conditional probability formula at all.Therefore,from the perspective of feature weighting,this paper proposes a bayesian algorithm based on term frequency-inverse document frequency feature weight and rank factor feature weight,and applies it to Chinese text classification,which can effectively manage huge and complex data,assist people to find information quickly and save time cost.The main research contents of this paper are as follows:(1)The naive bayes,KNN and support vector machine are compared in text classification.Through research and experiments,the results show that naive bayes algorithm is the best algorithm for Chinese text classification.(2)This paper proposes a naive bayes algorithm based on term frequency-inverse document frequency feature weight and rank factor feature weight—feature weighting naive bayes algorithm.This algorithm combines term frequency-inverse document frequency into the conditional probability formula of bayes,and then imports the rank factor feature weight determined by term frequency-inverse document frequency into bayesian formula,which can greatly weaken the influence of its feature independence assumption.(3)In this paper,the feature weighting naive bayes algorithm is applied to Chinese text classification.Due to the complexity of various corpuses on the network,there is no corpus that can be used consistently for Chinese text categorization so far,so this paper constructs a Chinese text corpus according to the screening rules.Experiments show that the accuracy of the feature weighting naive bayes algorithm in text classification is higher than that of the standard naive bayes algorithm,which proves that the proposed new algorithm is a more effective and accurate text information classification algorithm.
Keywords/Search Tags:naive bayes, feature weighting, Chinese text classification, term frequency-inverse document frequency
PDF Full Text Request
Related items