Font Size: a A A

Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification

Posted on:2019-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:R L ShiFull Text:PDF
GTID:2428330548476396Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the text data from one day on the Internet in increasing exponentially,the text classification system in the "dimension disaster" and the characteristics of sparse high,seriously affecting the effects of the classification of text classification.Therefore,this paper takes the feature selection algorithm in text classification as the main research object,and makes an in-depth study on the CHI-square feature selection algorithm and the information gain feature selection algorithm and makes the improvement.Selection algorithm based on the traditional CHI didn't consider the word frequency information,and enlarge the key weight is negatively related to the text categories of defects,this article is based on the traditional CHI feature selection algorithm,a self-adjusting feature selection method is proposed,this method is introduced into the adjustment scale factor,can automatically adjust the positively and negatively correlated with the text category of key weights,eliminates the artificially set scale factor error,and introduced the word frequency factor and variance between classes,make the final choice of key words in frequency in the particular text category,and less in the category of other text distribution,thus improving the precision of the feature selection.In view of the traditional information gain algorithm didn't consider the word frequency information and key discrete degree is insufficient,this article is based on the traditional information gain algorithm,by introducing the characteristics of frequency ratio and discrete degree of information,reduce the text in the collection of uneven distribution of key influence on feature selection,and then from the traditional calculation formula of information gain algorithm of key did not appear in the text type,further optimization of the algorithm,so as to improve the precision of the feature selection.This paper designed the related experiments to verify this two kinds of improved algorithm,through the contrast experiment shows that the improved algorithm of CHI better classification result is obtained in uniform corpus,the improved information gain method in non-uniform corpus achieve good classification effect.
Keywords/Search Tags:Text Classification, Feature Selection, Chi-square Statistic, Information Entropy
PDF Full Text Request
Related items