Font Size: a A A

Research And Implementation Of Text Classification Feature Selection

Posted on:2012-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:X L FanFull Text:PDF
GTID:2208330332993905Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Feature selection plays an important role in the text categorization. This paper put forward an improved mutual information feature selection and an improved Chi-square statistic feature selection,in allusion to the problem of having poor Classification efficiency on the class distributed unevenly corpus for mutual information feature selection, and having a poor Classification efficiency on the class distributed evenly corpus for Chi-square statistic feature selection.And then base on this, designed and implemented a classification system for Chinese text categorization.Thesis as follows:1)As to the problem of having poor Classification efficiency on the class distributed unevenly corpus for mutual information feature selection, based on study and analysis of the factors that affect the classification results of mutual information on class distribution evenly corpus, to improve the effect of classification, the ratio of positive feature and negative feature is adjusted with balance factor to strengthen the effect of negative feature. and category strong related feature is distinct with feature distributed factor. In the end Experiments show that the improved mutual information feature selection method improves the classification results.2) Chi-square statistic feature selection tends to choose high-frequency characteristics of words and only consider document frequency of features appeared. To resolve the problem and improve the classification results, this paper improve the Chi-square statistic feature selection with the method of adjusting the classification effect caused by positive feature and negative feature and incorporate the Symmetric entropy factor which rely on high-frequency, distincting document as Text autocorrelation factor.3)Besides above researches, this Paper designed and developed a classification system for Chinese text, achieved two improved programs put forward from this thesis and several common feature selection:mutual feature selection, IG feature selection,CHI feature selection, document frequency feature selection.in the end, compare their classification results by experiment.
Keywords/Search Tags:text categorization, feature selection, mutual information, Chi- square statistic
PDF Full Text Request
Related items