Font Size: a A A

Research On News Classification Based On Improved Naive Bayes

Posted on:2021-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:X F LaiFull Text:PDF
GTID:2370330623481119Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence and the continuous updating of data mining technology,text classification has become the most commonly used application scenario in natural language processing,and it has been widely used in public opinion analysis,machine translation,and chat robots.There are many text classification technologies at this stage,but Naive Bayes Classifier(NBC)has become one of the most commonly used classification models with solid mathematical theory and simple and efficient performance.The Naive Bayes classification model has good classification performance in many fields,but the classification model also has certain limitations,such as the need to meet the conditional assumptions that are independent of each other,and this conditional assumption is actually used in practice.Often difficult to satisfy.Based on this condition,it is assumed that researchers have extended the four aspects of extended structure,feature selection,feature weighting,and the combination of Naive Bayes model and other models,and have achieved good results.Based on previous research,this paper uses Principal Component Analysis(PCA)to improve the Naive Bayes classification model.Naive Bayes classification model based on principal component analysis,referred to as PCA_WNBC model.In this paper,the principal components of the principal component analysis are mutually independent,which effectively alleviates the conditional assumption that Naive Bayes is independent of each other;and then uses the variance contribution rate of the principal components as the feature weight of the attribute,eliminating the same attribute for different categories Defects of the same value(all weights are 1).After the above analysis,this paper applies the PCA_WNBC model to the example of news text classification.Using web crawler technology,use Python to crawl ten categories from the Internet,each category has 1200 articles,and a total of12,000 news texts are used as training sets.Randomly select 3000,6000,9000,and12000 articles in 12000 articles as the horizontal,NBC,PCA_WNBC,logistic regression,K-nearest neighbor,and support vector machine as the longitudinal,and evaluate each from four directions: accuracy,recall,value,and training timeClassification performance of classification models on different datasets.The conclusions are as follows: on different data sets,the accuracy of the PCA_WNBC model is about 5% higher than that of the NBC model;when the amount of data increases,the classification performance of the PCA_WNBC model is better than that of logistic regression,K nearest neighbors,and support vector machines.
Keywords/Search Tags:Naive Bayes, Principal Component Analysis, Web Crawler, News Classification
PDF Full Text Request
Related items