Study On Feature Selection And Feature Weighting Of Text Classification

Posted on:2011-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:J Jiang

Full Text:PDF

GTID:2178360308958909

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology and information technology, people can gain more and more knowledge. It has become a very significant issue regarding how to accurately, comprehensively and quickly find the desired information within a narrow field of knowledge from the huge number of information. Text classification technology, which is one of the key technologies to solve this problem, has become a hotspot of research.Text classification is a complex and systematic project, which includes text preprocessing, feature dimension reduction, feature weighting, classifier training and classifier performance evaluation. Based on a detailed study of these processes, this thesis focuses on characteristics of feature dimension reduction and feature weighting.Reducing the dimensions of high-dimensional feature set is an essential part of text categorization. It not only can improve the classifier's speed and save storage space, but also can filter out irrelevant attributes and reduce interference caused by irrelevant information on the text categorization process. Therefore, feature dimension reduction can enhance the accuracy of text classification and prevent over-fitting. Feature dimension reduction can be divided into two categories: feature extraction and feature selection. Feature selection has been effectively applied in text classification, because of its simplicity, fast calculation, suitable for handling large-scale text data. The current commonly used feature selection methods, such as document frequency, mutual information, information gain, expected cross entropy, Chi-square statistic and weight of evidence for text, have been studied in this thesis. The characteristic of each of these methods has been analyzed. In order to overcome the deficiencies of these methods, a new approach in feature selection was proposed by comprehensively taking concentration among categories, distribution in category and average frequency in category into account. This new approach is a simple and effective feature selection method, which highlights the positive correlation between features and categories, avoids the interference caused by the negative correlation between features and categories. Besides, the relevance between features and categories, the average frequency features occur within the class were comprehensively considered.Feature weighting can improve the distribution of text set in the vector space. It can make the spatial structure of the texts, which belong to the same category, more compact, and make the spatial structure of the texts, which belong to the different categories, more loose. Thus it can simplify the mapping from texts to categories, and improve the performance of the text classifier. The classical feature weighting method, TF-IDF, has also been studied in this thesis. The shortcoming of TF-IDF that does not takes the distribution of the features inter-class and within-class in consider, leads to the result that rare features are given large weight and features with the ability to distinguish categories are given small weight. To make up for the original TF-IDF formula defects, an improved TF-IDF formula, which combines concentration of a feature among categories, distribution of a feature in category, is proposed.To verify efficiency of the new feature selection approach and improved TF-IDF formula, a multi-set of experiments base on the Chinese text categorization test system platform have been taken. Recall, Precision and F1 are used as the evaluating indicators of experiments results. The results show that the new feature selection approach has a more excellent effect of reducing dimension than other staple feature selection methods, while the improved TF-IDF feature weighting method performs better than the traditional TF-IDF method.

Keywords/Search Tags:

Text Classification, Vector Space Model, Feature Selection, Feature Weighting

PDF Full Text Request

Related items

1	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
2	Study For Text Categorization Based On Feature Weighting
3	Research Of Text Categorization Based On Vector Space Model
4	Term Weight-Based Chinese Text Classification Algorithm
5	Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm
6	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
7	Research And Implementation Of Text Classification System Based On VSM
8	Research And Application Of Feature Selection And Feature Weighting Algorithm Of Text Classification
9	Feature Selection Based On Class Center And Feature Weighting
10	On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection