Font Size: a A A

Study On Feature Selection And Feature Weighting Of Text Classification

Posted on:2011-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:J JiangFull Text:PDF
GTID:2178360308958909Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and information technology, people can gain more and more knowledge. It has become a very significant issue regarding how to accurately, comprehensively and quickly find the desired information within a narrow field of knowledge from the huge number of information. Text classification technology, which is one of the key technologies to solve this problem, has become a hotspot of research.Text classification is a complex and systematic project, which includes text preprocessing, feature dimension reduction, feature weighting, classifier training and classifier performance evaluation. Based on a detailed study of these processes, this thesis focuses on characteristics of feature dimension reduction and feature weighting.Reducing the dimensions of high-dimensional feature set is an essential part of text categorization. It not only can improve the classifier's speed and save storage space, but also can filter out irrelevant attributes and reduce interference caused by irrelevant information on the text categorization process. Therefore, feature dimension reduction can enhance the accuracy of text classification and prevent over-fitting. Feature dimension reduction can be divided into two categories: feature extraction and feature selection. Feature selection has been effectively applied in text classification, because of its simplicity, fast calculation, suitable for handling large-scale text data. The current commonly used feature selection methods, such as document frequency, mutual information, information gain, expected cross entropy, Chi-square statistic and weight of evidence for text, have been studied in this thesis. The characteristic of each of these methods has been analyzed. In order to overcome the deficiencies of these methods, a new approach in feature selection was proposed by comprehensively taking concentration among categories, distribution in category and average frequency in category into account. This new approach is a simple and effective feature selection method, which highlights the positive correlation between features and categories, avoids the interference caused by the negative correlation between features and categories. Besides, the relevance between features and categories, the average frequency features occur within the class were comprehensively considered.Feature weighting can improve the distribution of text set in the vector space. It can make the spatial structure of the texts, which belong to the same category, more compact, and make the spatial structure of the texts, which belong to the different categories, more loose. Thus it can simplify the mapping from texts to categories, and improve the performance of the text classifier. The classical feature weighting method, TF-IDF, has also been studied in this thesis. The shortcoming of TF-IDF that does not takes the distribution of the features inter-class and within-class in consider, leads to the result that rare features are given large weight and features with the ability to distinguish categories are given small weight. To make up for the original TF-IDF formula defects, an improved TF-IDF formula, which combines concentration of a feature among categories, distribution of a feature in category, is proposed.To verify efficiency of the new feature selection approach and improved TF-IDF formula, a multi-set of experiments base on the Chinese text categorization test system platform have been taken. Recall, Precision and F1 are used as the evaluating indicators of experiments results. The results show that the new feature selection approach has a more excellent effect of reducing dimension than other staple feature selection methods, while the improved TF-IDF feature weighting method performs better than the traditional TF-IDF method.
Keywords/Search Tags:Text Classification, Vector Space Model, Feature Selection, Feature Weighting
PDF Full Text Request
Related items