Research On Improvement Of Feature Engineering Algorithm Based On Text Classification

Posted on:2023-01-01

Degree:Master

Type:Thesis

Country:China

Candidate:H J Yin

Full Text:PDF

GTID:2558307070973689

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

High-precision classification of text is a scientific method to effectively manage massive text data.Well-researched text feature selection and feature weighting algorithms can help computers efficiently identify text content and category information.On the basis of reading and studying a lot of filtering feature selection algorithm and TFIDF feature weighting algorithm,in this thesis,it is chooses to improve the three feature selection algorithms of chi-square statistic,information gain and expected cross entropy and TFIDF feature weighting algorithm.First of all,in order to further improve the classification accuracy,in view of the "low frequency word defect" in the chi-square statistic method in the filtering feature selection algorithm,the information gain and the expected cross entropy algorithm do not test the word frequency information.From a statistical point of view,the proposed method based on Characteristic multi-category text distribution,frequency distribution discrete coefficient,and the factor of characteristic word length are combined to improve the chi-square statistic,information gain,and expected cross-entropy.Secondly,because the TFIDF weighting algorithm does not consider the text distribution information including features,there will be deviations in measuring the weight of features in the text set,and the feature multi-category text distribution is introduced in the TFIDF weight calculation.Finally,the public Fudan text set and the crawled news text set are used as experimental data sets for empirical analysis.In this thesis,the TextRank algorithm is used to pre-extract the features of a single text,and the top 25% of the features in each text are extracted as candidate features,so as to improve the computational efficiency of the feature selection process.Then,the improved method is used for feature selection and feature weighting on the two text sets,and the macro average precision rate,macro average recall rate and macro average F1 value are used as evaluation indicators.Experiments were carried out under the device,and it can be found that although the improvement effect on the crawler text set is not obvious,the evaluation indicators have reached more than 96%,and the accuracy is high,and the improved evaluation indicators on the Fudan text set are all It is higher than the evaluation index before improvement,indicating that the algorithm improvement is feasible.And according to the experimental results on the two text sets,by comparing the differences between the text sets,it is verified that the method proposed in this paper is more effective for the classification of longer text sets.

Keywords/Search Tags:

text classification, feature selection, feature weighting, multi-category text distribution, frequency distribution, TextRank

PDF Full Text Request

Related items

1	Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm
2	Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM
3	Research On Feature Selection And Feature Weighting Of Text Classification
4	Research On Feature Selection And Weighting Methods Based On Terms Distribution
5	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
6	Research And Application Of Feature Selection And Feature Weighting Algorithm Of Text Classification
7	Research On Some Problems In Text Classification
8	The Method Of Text Categorization Scheme Selection And Development Of A Prototype System
9	The Research On Feature Selection Methods For Text Classification
10	Research And Implementation Of Text Classification Algorithm