Font Size: a A A

Analysis And Research On Feature Selection Algorithm For Text Classification

Posted on:2011-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2178360308455392Subject:Information security
Abstract/Summary:PDF Full Text Request
Automatic text classification is a procedure which classified the large number of semi-structured and unstructured documents that has unknown type into the given classification system, according to the content of documents. As the feature of semi-structured and unstructured text, the vector of the document always amount to hundreds of thousands of dimensions, which brings several bad influence on text classification. There are mainly two problems: first, many statistical classifiers, which make good performance on low dimensional space, will become inefficient and impracticable; second, too many features will make statistical estimation very difficult and reduce the generalization ability of statistical classifier while training samples is under certain numbers. So finding an effective method to reduce the dimensions of feature space and improve the efficiency and accuracy of classification is vital in the task of text classification.Feature reduction is a procedure which mapping high-dimensional space to a much lower-dimensional space to preserve as much as original information of data and eliminate many redundant and irrelevant features with categories. The existing methods of feature reduction contain two aspects: feature selection and feature extraction. Feature selection picks out some features which are conducive to classification algorithm from original feature set based on criteria and remove the redundant and irrelevant features. Feature extraction aims to create new feature set transformed from original feature set which has much lower feature dimensions. Since feature selection is efficient, suitable for dealing with the large-scale data set, we do many works on this aspect. There are two feature selection methods we proposed: a feature selection based on association analysis and a feature selection based on correlation to mutual information.Feature selection method based on correlation to mutual information is not only computing the relevance between the feature and categories, but also considering the correlation among the features. Improved mutual information was employed to measure the correlation between features and categories eliminate irrelevant and redundant features and retain as much as information of original data. Comparing with the benchmark of CHI and IG, our experimental results demonstrate that proposed method is effective in feature selection. Feature selection based on association analysis considers the association relationship among features while the traditional feature selection methods don't. First, the algorithm finds the two-word-sets which have significant impact on classification by mining the associating relationship among words. Some words in these sets may be discarded due to low scores achieved through the conventional feature selection methods, which will lead to the deviation of the classification results and lower classification accuracy. Then the algorithm reorders the features which had been ordered by traditional method by employing these two-word-sets. The experimental results on Ruters-21578 and 20-Newsgroup datasets prove that the proposed method effective.
Keywords/Search Tags:text classification, feature reduction, correlation, association analysis
PDF Full Text Request
Related items