Font Size: a A A

Study Of Mutual Information Feature Selection In Chinese Text Classification

Posted on:2012-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:C F DengFull Text:PDF
GTID:2178330335456047Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advent of computer technology and Internet technology, people can gain more and more information, which exists in text form all most. It is the key discussion that how to accurately, comprehensively and quickly mine the users interested information in such massive document data. Text categorization technology, which is one of the key technologies to solve this problem, has become an effective method of obtaining information. Reducing the dimensions of high-dimensional feature set is one of the difficulties of text categorization. Feature selection has been effectively applied in text classification, because of its low complexity of computing.Feature selection method is a direct impact on the result of text categorization. Many researches show that mutual information is a good feature selection method. The MI has two main properties that distinguish it from other dependency measures:first, the capacity of measuring any kind of relationship between variables; second, its invariance under space transformations. But the traditional mutual information approach still has the following disadvantages:(1) Mutual information method only considers the document frequency of term in the corpus, without taking into account the term frequency in each category of the corpus.(2)Mutual information method focuses on the correlation between terms and categories, without considering the connections between terms.(3)The number of texts in each category in corpus also has an influence on the value of mutual information. Some researchers have proposed enhancement about these disadvantages. Tan Jinbo enhanced the weight of high-frequency word through proposing a function about probability of terms that appears in corpus and selecting features from each class. Qin Jin reduced influence when the amount of text in each category is different through introducing correction factor.To remedying the defects of traditional mutual information method, this article improved measure of mutual information by introducing the feature frequency in class and the dispersion of feature in class, limiting the minimum term frequency, introducing a minimum feature redundancy measure method in model MRMR. Another work of this article is, to build a experimental platform by constructing a Chinese text classification system, which can be used for text preprocessing, feature selection and text classification. That means this system is divided into three modules. Each module is independent and has a unified interface.To verify efficiency and feasibility of the new improved feature selection approach, a multi-set of experiments base on the Chinese text categorization test system platform have been taken. Recall, Precision and Fl are used as the evaluating indicators of experiments results. The results show that the new feature selection approach has a more excellent effect of reducing dimension than the traditional mutual information approach and some improved approaches. This proves that the improved mutual information feature selection approach is feasible and effective.
Keywords/Search Tags:text classification, Feature Selection, Mutual Information
PDF Full Text Request
Related items