Font Size: a A A

Analysis And Study On Feature Selection Method In Chinese Text Categorization

Posted on:2013-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y W GuoFull Text:PDF
GTID:2248330374471784Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In Chinese text categorization, feature selection is a way which commonly used to reduce the feature space dimension. Based on the conscientious studying the feature selection methods and summarizing the current domestic research results, direct at the problem of the decreased classification efficiency on the unbalance class distribution corpus for the feature selection method of information gain, and the problem of having low classification performance on the balance class distribution corpus for the feature selection method of mutual information. This thesis put forward the idea of improving the feature selection method of information gain and mutual information respectively, and developed a text categorization system for Chinese classification at the same time. The main work as follows:1) analyzing the feature selection method of traditional information gain ignore the shortcoming of distributing information inside class and between classes, introduce the Distribution Information inside Class and Concentration Information between Classes, which is used to distinguish characteristics of strong correlation with class. Considering the problem of the feature selection method of traditional information gain not well combining positive feature and negative feature, the proportional factor is introduced to balance the effect of feature appear and disappear, which is decrease the effect of negative feature on the corpus of category uneven distribution and increase classification effect.2) In allusion to the problem that the feature selection method of mutual information not prone to select the high-frequency characteristics and lead to poor classification efficiency, introduce the co-occurrence probability P(c, k) of characteristic and category to select the strong relevant words to improve the classification efficiency. Considering the traditional mutual information feature selection method ignore the balance distribution of characteristic in the internal documents of the category, introduce the dispersion deviation within the class to characterize the feature’s balance distribution degree inside the class, which to improve the classification efficiency of mutual information feature selection method. 3) Based on the analysis and research above, this thesis implemented a text categorization system for Chinese classification in addition, and implemented the four types of common feature selection method and the improved information gain, mutual information feature selection method in the feature selection process, and through experiments to compare and evaluate the classification performance of the above algorithm, the system provided a platform for further study the related technologies of Chinese text categorization.
Keywords/Search Tags:text categorization, feature selection, information gain, mutual information
PDF Full Text Request
Related items