Font Size: a A A

Research Of Feature Selection Method For Chinese Text Classifization

Posted on:2013-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:J H ChenFull Text:PDF
GTID:2248330392951372Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of technology and network’s penetration, more and moredata is available to people and most of these data is in the form of text. Theseunstructured form of data leads to a status with large volume of data but withrelatively rare information. Text mining technology has provided an effective way tosolve this problem. Text classification techno1ogy is a branch of text miningtechnology, which means it is one key technology of managementing and organizingcomplex text data effectively. Text mining can help people organize and streaminformation effectively. Two important research directions of text classification are:feature selection method and text classification algorithm.Feature selection refers to select the feature terms which can best represent thecharacteristics of text from high-dimensional feature term space. Good featureselection method on one hand can reduce the dimension of the text feature space,resulting in the improvement of text classification efficiently, on the other hand goodfeature selection method can improve the accuracy of text classification throughremoving invalid feature terms. Good text classification method is able to improvetext classification result directly.Current feature selection algorithms frequently used in text categorization merelytake the correlation between feature and class into account but pay less attention tocorrelation between the features. In view of this situation, this paper proposes asyntaxic feature selection algorithm, which based on category discriminating powerand correlation analysis. The algorithm firstly uses discrimination power to extract thefeatures that reveal larger differences among categories to reduce the sparsity offeature spaces, and then employs correlation analysis of features to measure relativitybetween features and categories and redundancy among features, so can acquire thefeature subset which are more representative and have no redundancy each other.Experiments demonstrate that the proposed algorithm can improve the performance ofthe classifier effectively.
Keywords/Search Tags:text categorization, feature selection, category discriminating power, C-correlation, F-correlation, relevant independency
PDF Full Text Request
Related items