| Text classification is an effective method to analyze and use massive text data. One of the difficulties is to deal with the high dimension of features and it makes text classification very inefficient. Therefore, dimension reduction is the first problem to be solved in text classification system. As a feature vector space dimension reduction techniques, the performance of feature extraction method directly affects the effect of text classification. Many studies show that information gain is a relatively good feature selection method. However, the information gain algorithm still has its limitations and optimization space in the text classification. This paper has improved the algorithm from the following three aspects:(1) In order to balance the influence of each feature word in the information gain formula on the information gain score, using sigmoid function method, this paper proposed an improved algorithm of word frequencyγregulator based on information gain.(2) In order to reflect the relationship between the degree of uniformity and the ability of distinguishing feature words in each class, to ensure that the distribution of non uniform characteristics of the various classes of the characteristics of the word has a strong ability to distinguish. Focusing on the class distribution of feature words, this paper optimized the score of information gain.(3) Considering the large amount of non balanced text in the real text data, if the algorithm does not take into account the size of the number of documents contained in each class, it will make the algorithm give priority to the characteristics of the major class of the choice of words and ignore the characteristic words of small class. To avoid this happen, this paper proposed the idea of chi square test from statistics, and a classification method for non balanced document classification is proposed. In concequence of this optimization, the algorithm can still maintain good performance when the feature dimension is small.Comparative experimental results show that the accuracy rate, recall rate and F1 value in each class of this paper’s improved algorithm are better than that of the performance of the traditional one.Therefore, the optimized IG algorithm was proved to be feasible and effective. |