Classification Research On News Text Classification Based On Feature Selection Method

Posted on:2020-04-30

Degree:Master

Type:Thesis

Country:China

Candidate:Q Q Xiao

Full Text:PDF

GTID:2428330578973085

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

At present,with the continuous popularization of computers technology and the rapid development of the Internet,people can access variety of information about different industries from the Internet,and those information resources appear in the form of text.However,those messages are mixed,and it becomes more and more important to get valuable information in a short period of time.In order to suit users' demands,Chinese text classification method came into being.The thought of text classification is to classify text by combining statistical methods with machine learning.According to the characteristics of text content,the technology classifies text into different categories defined by users,which can help people find the information they need quickly and efficiently.In this paper,after word segmentation and Stop Words removal,we can obtain a set of words which contains more than 50,000 words.A great number of words will obtain the high dimensional vector space,which will affect the performance and efficiency of the classifier algorithm.Therefore,it is necessary to select the features words,that is,to select the words that have great influence on the classification performance.The paper improves the method of feature selection.The first one is to improve CHI-square-statistical feature selection method,the method only considers the number of texts feature words appear,but does not consider the frequency of feature words.For words with high word frequency,it is necessary to consider the phenomenon that is common in text sets,the paper improves Chi by introducing the TF-IDF weight of feature words.The second is to study the evaluation of XGBoost importance.This thought is generally used in the field of wind control to explain the importance of attributes and to select attributes.In the field of wind control,training model by iteratively selecting attributes,but in text classification,there are a great number of feature words,so we can't select one by one.In response to the deficiency,the paper proposes an XGBoost feature selection method suitable for text classification.Aiming at the shortcomings of low efficiency for this method,the paper proposes to use the weight of words to select the features before using XGBoost.For the calculation of importance value,the paper uses the number of times a feature word selected as the optimal partition attribute in all trees.The paper uses macro-average F1 scores,accuracy and other indicators to comprehensively analyze the classification results.The SVM,NB and Neural Networks algorithms are used to train and test the model,which proves the feasibility of the two methods CHI-TFW and XG-TI.

Keywords/Search Tags:

Chinese text classification, SVM, Naive Bayes, Neural Networks, XGBoost, Feature Selection, Chi-square Statistic

PDF Full Text Request

Related items

1	Chinese Web Pages Based On Naive Bayesian Classification Technology Research And Application
2	Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification
3	Research And Implementation Of Chinese Text Classification, Feature Selection Method,
4	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
5	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
6	Research On Local Feature Selection Of Chinese Text
7	Chinese Text Data Classification
8	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
9	Research On Text Classification Algorithm Based On Naive Bayes Method
10	Design And Implementation Of Text Classification System Based On K-neighborhood And Naive Bayesian