Unbalanced Text Classification Feature Selection

Posted on:2014-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:Q Wang

Full Text:PDF

GTID:2268330401469529

Subject:Education Technology

Abstract/Summary:

PDF Full Text Request

This paper focuses on the key issues in the non-balanced text classification feature selection. It develops a detailed non-balanced text categorization feature selection. This paper sums up the particularities of non-balanced text classification feature selection and urgent problem is to improve the classification accuracy in the small category after feature selection under the premise of non-reduce classification results on the entire data set.The author gives two main feature selection algorithm fit for non-balanced text (DFICF algorithm and advanced MI algorithm) specific analysis, and sums up the advantages of each algorithm, and also pointed out shortcomings:1. DFICF algorithm focuses on the one hand, taking into account the DF’s higher value of high frequency terms to ensure that most of the high-frequency terms carrying more text information on the entire data set is elected into subset of features, the other hand, taking into account a small number of small category texts. With the introduction of the ICF evaluation, feature selection algorithm also tends to those low-frequency terms in small category. The algorithm of DFICF balances the contradiction to choose both the high-frequency terms and low frequency terms. But, DFICF algorithm itself is bounded by quantitative distribution of the number of classes and categories of training text set distribution. It is sensitive to changes in the situation of text categories distribution and on the total number of training text set.2. Advanced MI algorithm takes into account not only the class distribution of the training samples, the class factor into account, but also the distribution situation of the training samples that the features appear. This algorithm can reduce the uneven distribution influence on mutual information. The feature selection algorithm owns higher computational complexity. When to calculate correlation of a feature and each category, if the number of texts between different categories has different magnitude, this feature selection program attempts to sacrifice the overall classification accuracy to improve local classification accuracy.Another study in the text focuses on exiting selection algorithms deficiencies and suggests improvements for existing non-balanced characteristics. The comprehensive three factors form the new algorithm TIM. TIM algorithm establishes on mutual information feature selection algorithm, retains the characteristics tending to the low-frequency terms. Two added factors:TF and ICF aim at preventing mutual information feature selection leading to excessive tendency of low-frequency terms. The experimental classification F1of TIM feature selection algorithm classification effect has more value than DFICF algorithm on small class sample; its macro-F1is definitely improved compared with the standard MI algorithm, DFICF algorithm and the advanced MI algorithm throughout the entire text set.

Keywords/Search Tags:

Unbalanced text classification, Feature selection, Mutual information, TIM

PDF Full Text Request

Related items

1	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
2	The Research Of Feature Selection Method In Text Classification Based On Triple-Play
3	The Research And Implementation Of Chinese Text Classification Based On Feature Selection And LDA
4	Research And Improvement Of Feature Selection Algorithm In Text Classification
5	Study Of Mutual Information Feature Selection In Chinese Text Classification
6	Research On Text Feature Selection And Classification Algorithms
7	Feature Selection For Unbalanced Data And Emotional Dictionary Building
8	Research And Implementation Of Text Classification Feature Selection
9	Research On The Algorithm Of Feature Selection Based On Mutual Information For Text Categorization
10	Research Of Feature Selection For Text Classification