Font Size: a A A

Research On Improvement Of Feature Engineering Algorithm Based On Text Classification

Posted on:2023-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:H J YinFull Text:PDF
GTID:2558307070973689Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
High-precision classification of text is a scientific method to effectively manage massive text data.Well-researched text feature selection and feature weighting algorithms can help computers efficiently identify text content and category information.On the basis of reading and studying a lot of filtering feature selection algorithm and TFIDF feature weighting algorithm,in this thesis,it is chooses to improve the three feature selection algorithms of chi-square statistic,information gain and expected cross entropy and TFIDF feature weighting algorithm.First of all,in order to further improve the classification accuracy,in view of the "low frequency word defect" in the chi-square statistic method in the filtering feature selection algorithm,the information gain and the expected cross entropy algorithm do not test the word frequency information.From a statistical point of view,the proposed method based on Characteristic multi-category text distribution,frequency distribution discrete coefficient,and the factor of characteristic word length are combined to improve the chi-square statistic,information gain,and expected cross-entropy.Secondly,because the TFIDF weighting algorithm does not consider the text distribution information including features,there will be deviations in measuring the weight of features in the text set,and the feature multi-category text distribution is introduced in the TFIDF weight calculation.Finally,the public Fudan text set and the crawled news text set are used as experimental data sets for empirical analysis.In this thesis,the TextRank algorithm is used to pre-extract the features of a single text,and the top 25% of the features in each text are extracted as candidate features,so as to improve the computational efficiency of the feature selection process.Then,the improved method is used for feature selection and feature weighting on the two text sets,and the macro average precision rate,macro average recall rate and macro average F1 value are used as evaluation indicators.Experiments were carried out under the device,and it can be found that although the improvement effect on the crawler text set is not obvious,the evaluation indicators have reached more than 96%,and the accuracy is high,and the improved evaluation indicators on the Fudan text set are all It is higher than the evaluation index before improvement,indicating that the algorithm improvement is feasible.And according to the experimental results on the two text sets,by comparing the differences between the text sets,it is verified that the method proposed in this paper is more effective for the classification of longer text sets.
Keywords/Search Tags:text classification, feature selection, feature weighting, multi-category text distribution, frequency distribution, TextRank
PDF Full Text Request
Related items