Font Size: a A A

Research On Feature Selection And Feature Weighting Of Text Classification

Posted on:2015-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:T S DuFull Text:PDF
GTID:2298330467463523Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, people are getting more and more data. Most of the data are the text. How to find the required information from data accurately has become a very interesting topic. As a key technology in organizing and processing large amounts of text data, Text Classification has been widely used in many fields.Text Classification is a complex system engineering. Based on the analysis of text preprocessing, Feature Selection, Feature Weighting, the algorithm of classification and classification performance evaluation, this paper focuses Feature Selection and Feature Weighting, and mainly finished the following work:1. Feature Selection is an important part of the Text Classification. It can not only improve the speed of the classifier, save storage space, but also filter some unrelated features, reduce interference of the irrelevant features. This paper studies the popular Feature Selection algorithms in detail, and analyzes characteristics of these algorithms. In the meanwhile, given the shortcoming of Expected Cross Entropy and taking into account of the concentration between the classes and distribution within the classes, this paper presented the Expected Cross Entropy based on concentration degree and distribution degree, which can combine the feature items’distribution uniformity between the classes or within the classes. The experiment proved that Expected Cross Entropy based on concentration degree and distribution degree improved the accuracy of the classification.2. Feature weighting can give a certain weight to teach text feature according to its contribution to the classification. This paper studied the classical Feature Weighting algorithm TF-IDF, and improves it with concentration degree and distribution degree. The experiment proved that TF-IDF based on concentration degree and distribution degree improved the accuracy of the classification.3. This paper completes the design and realization of the Chinese Text Classification experiment platform, and examines the effectiveness of the Expected Cross Entropy based on the concentration degree and distribution degree. Results show that the improved Expected Cross Entropy and TF-IDF-CD algorithm are better than traditional Expected Cross Entropy and TF-IDF.
Keywords/Search Tags:text classification, feature selection, featureweighting, concentration, degree, distribution
PDF Full Text Request
Related items