Font Size: a A A

Research On Local Feature Selection Of Chinese Text

Posted on:2021-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2428330626455450Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Nowadays,with the rapid development of the Internet,a large amount of data has been generated in various fields.How to make full use of these data is the most urgent problem to be solved at present.In the process of processing document data,automatic text classification technology is often used to realize the quantitative storage of document data.Such classification technology should also be improved.The reduction of feature dimension is particularly important.The feature space dimension is reduced through local feature selection method,redundant keywords are removed,keywords that can represent various categories are accurately selected,and the performance of the classifier itself is improved,thus improving the text classification accuracy.Traditional card party statistics,the author of this paper(CHI)for feature selection,the key is only considered whether any shortcomings in this category,and negative correlation correction factor,the weights of the introduction of word frequency was proposed to get a new CHI feature selection method,the improved method for selecting,based on this,considering the introduction of the co-occurrence matrix,put forward a new local feature selection method,considering the correlation between key words and categories,and considering the semantic relationship between key,make sure the text features redundant information containing as little as possible,so as to reduce the dimension of feature vector.Firstly,the improved chi-square feature selection method is compared with the traditional chi-square feature selection method,and the simulation classification experiment is carried out.Second,use XGBoost(gradient lifting method)calculate important degree of the keywords in the classification process,according to the article word matrix,get the class number of word frequency matrix,by using the improved card feature selection methods,chi-square matrix computation category selected the mostcalories square value corresponding to the category of the various keywords,respectively,as the representative of the first class in the first of key words.Then,the co-occurrence intensity matrix of local words was calculated,and the threshold values were set respectively to carry out the importance comparison.Keywords with low importance were eliminated to reduce the redundancy of keywords.Finally,the keyword subsets obtained by each category according to different local word co-occurrence information are intersected to obtain the final corresponding keyword subsets of each literature category,and the keyword subsets corresponding to all categories are combined to obtain the global keyword subsets,which are used as the feature variables of vector space to represent the data in text.This paper uses two kinds of data types: class balance text data,text date and class disequilibrium of local feature selection,the two groups of experimental data and by using different classification algorithm,get the feature selection,the accuracy of data classification indexes before and after comparison,the experimental results show that the local feature selection method is suitable for the two classes equilibrium and disequilibrium,and applied to text categorization can obtain better classification results.
Keywords/Search Tags:Local feature selection, Chi-square statistics, XGBoost, Word co-occurrence analysis, Text classification
PDF Full Text Request
Related items