Font Size: a A A

Chinese Text Classification And Python Implementation Based On Naive Bayes

Posted on:2019-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2438330548455967Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
At present,with the continuous popularization of computers and the rapid development of the Internet,there is a constant emergence of new scientific and technological knowledge in this era of information explosion with an unprecedented amount of information.The sources of information are extremely broad,and the speed of dissemination is extremely fast.It has become an urgent need for people to obtain valuable information from vast information in a short period of time.In order to meet people's needs,a Chinese text classification method in text data mining has been developed.It is a combination of statistical methods and machine learning methods applied to text classification.Chinese text classification is based on attribute features such as topic words of text content,and the corresponding text is divided into categories defined by the user according to requirements.Generally,the text classification category of the output result is obtained by inputting the feature vector of the text.This paper first introduces the research background of text classification,research status at home and abroad,and the practical application value of this method.Then it introduces the theoretical analysis process of Chinese text classification and the theoretical thinking of naive Bayes classifier and logistic regression classifier.In the experiment stage,the news data of the five categories under the “Sogou Corpus” was selected,and then the corpus was programmed according to the theoretical process of Chinese text classification using Python's integrated environment anaconda.Firstly,word segmentation and deletion word processing were performed on the data set.Then TF-IDF was combined with N-Gram to perform dimensionality reduction processing.The Na?ve Bayes classifier and logistic regression classifier were constructed to classify Chinese texts.In order to make the precision and recall rate of the classifier performance indicators more accurate,a cross-validation method was used.Finally,the classifier's optimal parameters were searched.After comparison,it was found that the naive Bayes classifier has better classification effect.
Keywords/Search Tags:Chinese text classification, TF-IDF, Naive Bayesian classfier, Python
PDF Full Text Request
Related items