Font Size: a A A

Classification Research On News Text Classification Based On Feature Selection Method

Posted on:2020-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q XiaoFull Text:PDF
GTID:2428330578973085Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
At present,with the continuous popularization of computers technology and the rapid development of the Internet,people can access variety of information about different industries from the Internet,and those information resources appear in the form of text.However,those messages are mixed,and it becomes more and more important to get valuable information in a short period of time.In order to suit users' demands,Chinese text classification method came into being.The thought of text classification is to classify text by combining statistical methods with machine learning.According to the characteristics of text content,the technology classifies text into different categories defined by users,which can help people find the information they need quickly and efficiently.In this paper,after word segmentation and Stop Words removal,we can obtain a set of words which contains more than 50,000 words.A great number of words will obtain the high dimensional vector space,which will affect the performance and efficiency of the classifier algorithm.Therefore,it is necessary to select the features words,that is,to select the words that have great influence on the classification performance.The paper improves the method of feature selection.The first one is to improve CHI-square-statistical feature selection method,the method only considers the number of texts feature words appear,but does not consider the frequency of feature words.For words with high word frequency,it is necessary to consider the phenomenon that is common in text sets,the paper improves Chi by introducing the TF-IDF weight of feature words.The second is to study the evaluation of XGBoost importance.This thought is generally used in the field of wind control to explain the importance of attributes and to select attributes.In the field of wind control,training model by iteratively selecting attributes,but in text classification,there are a great number of feature words,so we can't select one by one.In response to the deficiency,the paper proposes an XGBoost feature selection method suitable for text classification.Aiming at the shortcomings of low efficiency for this method,the paper proposes to use the weight of words to select the features before using XGBoost.For the calculation of importance value,the paper uses the number of times a feature word selected as the optimal partition attribute in all trees.The paper uses macro-average F1 scores,accuracy and other indicators to comprehensively analyze the classification results.The SVM,NB and Neural Networks algorithms are used to train and test the model,which proves the feasibility of the two methods CHI-TFW and XG-TI.
Keywords/Search Tags:Chinese text classification, SVM, Naive Bayes, Neural Networks, XGBoost, Feature Selection, Chi-square Statistic
PDF Full Text Request
Related items