Font Size: a A A

Unbalanced Text Classification Feature Selection

Posted on:2014-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2268330401469529Subject:Education Technology
Abstract/Summary:PDF Full Text Request
This paper focuses on the key issues in the non-balanced text classification feature selection. It develops a detailed non-balanced text categorization feature selection. This paper sums up the particularities of non-balanced text classification feature selection and urgent problem is to improve the classification accuracy in the small category after feature selection under the premise of non-reduce classification results on the entire data set.The author gives two main feature selection algorithm fit for non-balanced text (DFICF algorithm and advanced MI algorithm) specific analysis, and sums up the advantages of each algorithm, and also pointed out shortcomings:1. DFICF algorithm focuses on the one hand, taking into account the DF’s higher value of high frequency terms to ensure that most of the high-frequency terms carrying more text information on the entire data set is elected into subset of features, the other hand, taking into account a small number of small category texts. With the introduction of the ICF evaluation, feature selection algorithm also tends to those low-frequency terms in small category. The algorithm of DFICF balances the contradiction to choose both the high-frequency terms and low frequency terms. But, DFICF algorithm itself is bounded by quantitative distribution of the number of classes and categories of training text set distribution. It is sensitive to changes in the situation of text categories distribution and on the total number of training text set.2. Advanced MI algorithm takes into account not only the class distribution of the training samples, the class factor into account, but also the distribution situation of the training samples that the features appear. This algorithm can reduce the uneven distribution influence on mutual information. The feature selection algorithm owns higher computational complexity. When to calculate correlation of a feature and each category, if the number of texts between different categories has different magnitude, this feature selection program attempts to sacrifice the overall classification accuracy to improve local classification accuracy.Another study in the text focuses on exiting selection algorithms deficiencies and suggests improvements for existing non-balanced characteristics. The comprehensive three factors form the new algorithm TIM. TIM algorithm establishes on mutual information feature selection algorithm, retains the characteristics tending to the low-frequency terms. Two added factors:TF and ICF aim at preventing mutual information feature selection leading to excessive tendency of low-frequency terms. The experimental classification F1of TIM feature selection algorithm classification effect has more value than DFICF algorithm on small class sample; its macro-F1is definitely improved compared with the standard MI algorithm, DFICF algorithm and the advanced MI algorithm throughout the entire text set.
Keywords/Search Tags:Unbalanced text classification, Feature selection, Mutual information, TIM
PDF Full Text Request
Related items