Font Size: a A A

Research And Improvement Of Text Classification

Posted on:2023-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:J HuFull Text:PDF
GTID:2568306836476284Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Thanks to the rapid rise of the internet and self-publishing,everyone can be a content producer and there has been an explosion of information documents of all kinds.There is no shortage of sources of information,but it is becoming increasingly difficult to find the information we need.In order to facilitate user navigation and search,text needs to be classified appropriately.Feature selection and feature weighting are two integral parts of text classification,so finding effective methods for feature selection and feature weighting is of great theoretical and practical importance for text classification.In this thesis,the relevant algorithms are investigated,and the main research contents are as follows.(1)Improvement of the cardinality statistic calculation method.The feature selection module is an important part of text classification,which takes the input high-dimensional feature vector and calculates it to output a subset of features that are representative of the category.Three shortcomings of the traditional cardinality algorithm are that it exaggerates the role of low-frequency words,does not consider the distribution of feature words in the category,and does not consider feature words that are negatively correlated with the category.Based on this,this thesis proposes an improved chisquare calculation method IMP_CHI by introducing word frequency adjustment parameters,intracategory position parameters and negative correlation correction factors,and the experimental results show that the improved chi-square calculation method IMP_CHI has improved the accuracy,completeness and F1 value than the traditional chi-square calculation method CHI.(2)Improvement of TF-IDF algorithm.The feature weighting module takes the feature subset obtained from the feature selection module as input and gives different weights to the feature words according to their contribution to the text.The traditional TF-IDF algorithm has two shortcomings: it only focuses on the number of feature words instead of the number of documents,and it only focuses on the distribution of feature words within a class,ignoring the distribution between classes.Based on this,this thesis proposes an improved TF-IDF algorithm by adopting the TF-IWF algorithm and introducing the inter-class weighting factor,thus proposing an improved TF-IDF algorithm,and then combining it with the improved cardinality algorithm IMP_CHI proposed in this thesis,a support vector machine is selected to complete the text classification.The experimental results show that the improved TF-IDF algorithm has improved the accuracy rate,completeness rate and F1 value than the TF-IDF algorithm before the improvement.
Keywords/Search Tags:Cardinality Statistics, TF-IDF, Feature Selection, Feature Weights, Text Classification
PDF Full Text Request
Related items